1 Introduction

BayPass is a genome–environment association (GEA) method developed by Gautier (2015). Its models explicitly account for (and can estimate) the covariance structure between population allele frequencies that arises from the shared history of populations. BayPass provides three models:
- Core model: An FST-scan method that does not incorporate climatic or environmental variables.
- Auxiliary covariate model: A GEA method that does not account for population structure.
- Standard covariate model: A GEA method that accounts for population structure.

We will use the Standard Covariate Model to perform outlier detection. This model calculates a scaled covariance matrix between populations, which is used as a covariate to account for population structure. It assesses the extent to which each marker is linearly associated with the covariates (the covariance matrix and the climatic/environmental variables).

The regression coefficient for each SNP with the covariates is estimated using either MCMC (not detailed here) or the IS (importance sampling) approximation. The IS approach also allows estimation of the Bayes factor, which evaluates support for the association of each SNP i with covariate k, i.e., comparing the model with association (βᵢₖ ≠ 0) to the null model (βᵢₖ = 0) (BayPass manual).

Steps to perform outlier detection with BayPass:
- Prepare genomic data (with and without MAF filtering).
- Prepare climate/environmental data in the required format.
- Run the Standard Covariate Model with IS on Ubuntu.
- Analyze the results following the BayPass manual.

#meta data pop
meta_data_pop <- read.csv("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Populations/taxus_sample_29pop.csv",h=T,sep=";",dec=",")
#meta data indiv
meta_data_vcf=read.csv("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Samples/samples_taxus_baccata_adapcon_gentree.csv",h=T,sep=";",dec=",")

2 Data

The genomic data must be formatted with SNPs in rows, with one row per SNP, and populations in columns (the number of columns should be twice the number of populations), with two columns per population (one for each allele of a SNP). The genomic information is encoded in allele counts at the population level, as shown below:
POP1 POP2 POP3 POP4 POP5 POP6
SNP 1: 71 8 115 0 61 36 51 39 10 91 69 58
SNP 2: 82 0 91 0 84 14 24 57 28 80 18 80

We will need to create two genomic files in this format: one for the GEA and another to calculate the covariance matrix to account for the population structure. The covariance matrix file will not be filtered for Minor Allele Count (MAC), as MAC is important for assessing population structure.

2.1 Format of the genomic data for the GEA analysis (corrected for MAC)

The genomic matrix consists of the imputed and MAC-corrected dataset (for GEA) or the dataset without MAC correction (for the omega matrix), comprising 475 individuals and 8,616 SNPs.

#genomic data for outlier detection
load("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/genetic_new/Gen_matrix_imp_T_Adapcon_Gentree_475_8616.Rdata")
genomic_data_maf_c <-Gen_matrix_imp_T_Adapcon_Gentree_475_8616

#genomic data with MAF (non corrected) for covariance matrix
load("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/genetic_new/GEA/Data_geno_no_maf_c_8252SNP_452IND.Rdata")

genomic_data_no_maf_c <- Data_geno_no_maf_c_8252SNP_452IND
genomic_data_maf_c_ID <- rownames_to_column(genomic_data_maf_c)
#name of VCF_ID to merge 
names(genomic_data_maf_c_ID)[names(genomic_data_maf_c_ID) == 'rowname'] <- 'VCF_ID'

#add the population information
genomic_data_MAF_pop <- data.frame(merge(meta_data_vcf[,c(1,3)],genomic_data_maf_c_ID, "VCF_ID"))
#This function enables to calculate the allelic count of the 1st allele of SNP from the genotypic data in format: 0, 1, 2. 
reformat_genotype <- function(allele) {
  allele1 <- case_when(
    allele == " 0" ~ 2,
    allele == " 1" ~ 1,
    allele == " 2" ~ 0,
    allele == "0" ~ 2,
    allele == "1" ~ 1,
    allele == "2" ~ 0,
    TRUE ~ NA_real_
  )
  return(allele1)
}
#Apply the reformat_genotype function to SNP columns and calculate Allele2. Allele1 and 2 in two different dataframe to merge them more easily after. 

#allele1 
df_allele1 <- genomic_data_MAF_pop %>%
  mutate(across(starts_with("Tbac"), ~ reformat_genotype(.))) 

  
#allele 2
df_allele2 <- df_allele1 %>%
  mutate_at(vars(starts_with("Tbac")), ~ 2 - .)#to calculate the allele2, we just did 2- the count of allele1 for each individual for each snp. 


# Group by Population and summarize allele counts
#allele1
allele_counts_allele1 <- df_allele1 %>%
  group_by(Population) %>%
  summarize(
    across(starts_with("Tbac"), ~ sum(.)),
  )

#allele2
allele_counts_allele2 <- df_allele2 %>%
  group_by(Population) %>%
  summarize(
    across(starts_with("Tbac"), ~ sum(.)),
  )
#rename allele_counts_2 to merge them with allele1
allele_counts_allele2$Population <- paste0(allele_counts_allele2$Population,"_Allele_2")

#final dataset with both allele 1 and 2 

final_dtf <- rbind(allele_counts_allele1,allele_counts_allele2);row.names(final_dtf) <- final_dtf$Population

#order the population to have the format where allele1 and allele 2 for each pop are beside
final_r_order <- final_dtf[order(row.names(final_dtf)), ]

#allele in row and population*2 in columns side by side
data_allele_count_BAYPASS_MAF_c <- data.frame(t(final_r_order))

#export the data in txt
write.table(x=data_allele_count_BAYPASS_MAF_c,
  file = "C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/GEA/BAYPASS/data_allele_count_BAYPASS_MAF_c.txt", 
            sep = " ",
            row.names = F, 
            col.names = F) 

The data_allele_count_BAYPASS_MAF_c file contains the genotypic data to run the BAYPASS standard covariate model with importance sampling (IS).

2.2 Format of the genomic data for the covariance matrix (non corrected for MAC)

genomic_data_maf_no_c_ID <- rownames_to_column(genomic_data_no_maf_c)
#name of VCF_ID to merge 
names(genomic_data_maf_no_c_ID)[names(genomic_data_maf_no_c_ID) == 'rowname'] <- 'VCF_ID'

#add the population information
genomic_data_no_MAF_pop <- data.frame(merge(meta_data_vcf[,c(1,3)],genomic_data_maf_no_c_ID, "VCF_ID"))
# Apply the reformat_genotype function to SNP columns and calculate Allele2. Allele1 and 2 in two different dataframe to merge them more easily after. 

#allele1 
df_allele1_no_mac <- genomic_data_no_MAF_pop %>%
  mutate(across(starts_with("Tbac"), ~ reformat_genotype(.))) 

#allele 2
df_allele2_no_mac <- df_allele1_no_mac %>%
  mutate_at(vars(starts_with("Tbac")), ~ 2 - .)#to calculate the allele2, we just did 2- the count of allele1 for each individual for each snp. 

# Group by Population and summarize allele counts

#allele1
allele_counts_allele1_no_mac <- df_allele1_no_mac %>%
  group_by(Population) %>%
  summarize(
    across(starts_with("Tbac"), ~ sum(.,na.rm = TRUE)),
  )

#allele2
allele_counts_allele2_no_mac <- df_allele2_no_mac %>%
  group_by(Population) %>%
  summarize(
    across(starts_with("Tbac"), ~ sum(.,na.rm = TRUE)),
  )
#rename allele_counts_2 to merge them with allele1
allele_counts_allele2_no_mac$Population <- paste0(allele_counts_allele2_no_mac$Population,"_Allele_2")

#final dataset with both allele 1 and 2 

final_dtf_no_mac <- rbind(allele_counts_allele1_no_mac,allele_counts_allele2_no_mac);row.names(final_dtf_no_mac) <- final_dtf_no_mac$Population

#order the population to have the format where allele1 and allele 2 for each pop are beside
final_r_order_no_mac <- final_dtf_no_mac[order(row.names(final_dtf_no_mac)), ]

#allele in row and population*2 in columns side by side
data_allele_count_BAYPASS_MAF_no_c <- data.frame(t(final_r_order_no_mac))

#export the data in txt
write.table(x=data_allele_count_BAYPASS_MAF_no_c,
  file = "C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/GEA/BAYPASS/data_allele_count_BAYPASS_MAF_no_c.txt", 
            sep = " ",
            row.names = F, 
            col.names = F) 

data_allele_count_BAYPASS_MAF_no_c contains the genomic data used to run the core model of BAYPASS to calculate the omega matrix.

2.3 Climatic data

We load the non scale climatic data:

#climatic data 
climatic_data <- read.csv("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Climatic_data/new_selection/Past_new_6_Climatic_data_scale_df.csv",sep=";",dec=",")

climatic_data_BAYPASS <- data.frame(t(climatic_data))

The units are
- Bio1: annual mean temperature (°C)
- Bio2: Mean diurnal range (mean of max temp - min temp)
- Bio4: Seasonality of temperature (standard deviation *100)
- Bio9: Mean temperature of driest quarter (°C)
- Bio12: Total (annual) rainfall (mm)
- Bio15: Precipitation seasonality (variation coefficient)

We need to format the data for BAYPASS, which can be done in two ways: either by creating a separate text file for each climate variable with populations in columns, or by creating a single text file containing all climate variables in rows and populations in columns.

write.table(x=climatic_data_BAYPASS[-c(1,2),],
  file = "C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/GEA_new_var/BAYPASS/climatic_data_BAYPASS.txt", 
            sep = " ",
            row.names = F, 
            col.names = F) 

climatic_data_BAYPASS contains the climatic data used to run the standard covariate model with IS in BAYPASS.

3 Covariate model

We ran the core model to estimate the covariance matrix used to correct for population structure. We used genomic data without MAC correction: 8,252 SNPs, 29 populations, with allele counts at the population level (data_allele_count_BAYPASS_MAF_no_c). The core model was run with the following parameters:
i) 20 pilot runs of 500 iterations (to adjust proposal distributions)
ii) a burn-in period of 5000 iterations
iii) final MCMC sampling of 1000 parameter values sampled every 20 iterations (i.e., a total of 20 000 iterations)

The covariance matrix used to correct for population structure is scaled directly within the model, so no additional scaling is required.

The output results are here:

omega <- as.matrix(read.table("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/GEA/BAYPASS/covari_matrix_core_model/out_mat_omega.out"))
country_names <- climatic_data$Country
dimnames(omega)=list(country_names,country_names)

Now, we can visualize the covariance matrix using plots, corrplot, and heatmaps, as proposed by Gautier (BAYPASS Manual) and following Archambeau et al. (2024).

plot.omega <- function(omega, PC = c(1,2), country, main = expression(Omega*" matrix"), 
                       group_palette, pch = 16, expand_factor = 1.5) {
  om.svd <- svd(omega)
  eig <- om.svd$d
  pcent.var <- 100 * eig / sum(eig)

  # Assign colors based on country
  col <- group_palette[country]

  # Get min and max values for PC axes
  x_range <- range(om.svd$u[, PC[1]])
  y_range <- range(om.svd$u[, PC[2]])

  # Expand range symmetrically
  x_expand <- diff(x_range) * (expand_factor - 1) / 2
  y_expand <- diff(y_range) * (expand_factor - 1) / 2

  xlim <- x_range + c(-x_expand, x_expand)
  ylim <- y_range + c(-y_expand, y_expand)

  # Adjust margins for legend space
  par(mar = c(5, 5, 4, 6))  # Right margin adjusted for legend
  plot(om.svd$u[, PC], main = main, pch = pch, col = col, 
       xlim = xlim, ylim = ylim,  
       xlab = paste0("PC", PC[1], " (", round(pcent.var[PC[1]], 2), "%)"),
       ylab = paste0("PC", PC[2], " (", round(pcent.var[PC[2]], 2), "%)"),
       cex.lab = 1.5, cex.axis = 1.3, cex.main = 1.8  
  )
  # Fix legend position 
  legend("topright", inset = c(-0.15, 0), legend = names(group_palette), 
         col = group_palette, pch = 16, title = "Country", 
         cex = 1, bty = "n", xpd = TRUE, ncol = 2)  

  
}

# Increase plotting area 
dev.new(width = 10, height = 8)  

group_palette <- c("Bosnia"="orangered3", "France"="gold2","Germany"= "darkorchid3", 
                   "Greece"="navyblue", "Italy"="turquoise2", "Norway"="green3", 
                   "Slovakia"="blue", "Slovenia"="red", "Spain"="black", 
                   "Sweden"="gray", "Switzerland"="orange", "UK"="darkgreen")

#order country in the matrix
country_vector <- c("Italy", "Switzerland","Slovenia","Norway","Spain","Spain","Spain","UK","Italy","Italy","Slovenia","Slovakia","Switzerland","Switzerland","Greece","Greece","Greece","Germany","Spain","UK","France","Sweden","France","Spain","Slovenia","Italy","Bosnia","Switzerland","UK") # 

# Run the function
plot.omega(omega, PC = c(1,2), country = country_vector, group_palette = group_palette)
# as a correlation plot
cor_mat <- cov2cor(omega)
corrplot::corrplot(cor_mat)

# corrplot(cor_mat,method="color",mar=c(2,1,2,2)+0.1,
# main=expression("Correlation map based on"~hat(Omega)))

# as a heatmap and hierarchical clustering tree (using the average agglomeration method)
##we use the population names
pop_names <- climatic_data$Population
dimnames(omega)=list(pop_names,pop_names)

#cor matrix 
cor_mat <- cov2cor(omega)

hclust_ave <- function(x) hclust(x, method="average")
Heatmap_omega_matrix<-heatmap(1-cor_mat,hclustfun = hclust_ave,
main=expression("Heatmap of "~hat(Omega)~"("*d[ij]*"=1-"*rho[ij]*")"))

The plot of the omega (covariance) matrix closely resembles the population distribution obtained from a PCA using the first two axes. This indicates that the correction for population structure in BAYPASS is comparable to the corrections achieved with pRDA, LFMM, and Gradient Forest analyses.

4 standard covariate model

4.1 Run under Linux

We need to run the IS models multiple times to ensure that the outputs are consistent across runs. The models are executed using the IS method via the Linux command-line interface of BAYPASS. We will interpret the results using the Bayes factor (BF) and the empirical Bayesian p-values (eBPis).

4.2 Results

First, we extracted the BF and eBPis from the BAYPASS output (out_BAYPASS_outliers_ide_{seed}_summary_betai_reg.out)

#names snp
names_snps <- colnames(genomic_data_maf_c)

#name climatic data
  climatic_variables <- colnames(climatic_data)
  COVARIABLE <- c(1,2,3,4,5,6) #to merge with output results of BAYPASS and have the name of the climatic variables
climatic_variables_merge <- data.frame(climatic_variables[-c(1,2)], COVARIABLE)

#only the name of the variables 
names_climatic_variables <- climatic_variables[-c(1,2)]
for(x in 1:5){
  seed <- x+11#because the selected seeds of the runs are: 12 to 16
  output_BAYPASS_run <- read.table(paste0("C:/Users/tfrancisco/Documents/Thesis/Results/species/taxus/GEA_new_var/BAYPASS/runs/out_BAYPASS_outliers_ide_",seed,"_summary_betai_reg.out"),h=T)
  
  #add the name of the climatic variables
  BAYPASS_clim <- merge(climatic_variables_merge,output_BAYPASS_run,"COVARIABLE")

BAYPASS_clim$climatic_variables..c.1..2.. <- as.factor(BAYPASS_clim$climatic_variables..c.1..2..)

#add names snps
subset_name <- paste0("RUN_",x, "_BAYPASS_results") # Name the output dataframe of the loop

  Data_results <- BAYPASS_clim %>% 
    group_by(climatic_variables..c.1..2..) %>% 
    mutate(names_snps = names_snps) %>%
    ungroup()
  
  assign(subset_name, Data_results)
}

names_subset_data_frame <- c("RUN_1_BAYPASS_results","RUN_2_BAYPASS_results","RUN_3_BAYPASS_results","RUN_4_BAYPASS_results","RUN_5_BAYPASS_results")

We also added the run info for BF and for EBpis.

## we also add the RUN info
    for (x in 1:5) {
      var <- paste0("RUN_",x,"_BAYPASS_results")
    # Create the new dataframe name
    names_final <- paste0(var, "_final")
    
    # Extract the dataframe using get
    data <- get(var)
    
    name_BF.dB. <- paste0("BF_RUN",x)
    nam_eBPis <- paste0("eBPis_RUN",x)
  
    # Mutate the data
    names(data)[names(data)== "BF.dB."] <- name_BF.dB.
    names(data)[names(data)== "eBPis"] <- nam_eBPis
    
    # Assign the mutated data to a new dataframe
    assign(names_final, data)
    }

Finally, we merged all the runs in on dataframe

#merge all runs
data_tot_results_allRUNs <- cbind(RUN_1_BAYPASS_results_final[,c(12,2,8,11)],RUN_2_BAYPASS_results_final[,c(8,11)],RUN_3_BAYPASS_results_final[,c(8,11)],RUN_4_BAYPASS_results_final[,c(8,11)],RUN_5_BAYPASS_results_final[,c(8,11)])

4.3 Results across runs

Next, we assessed whether the runs produced consistent results. To do this, we calculated the correlation of the Bayes factors (BF) and empirical Bayesian p-values (eBPis) across runs for each climatic variable.

First, we subset the dataset of all runs by climatic variable to create one dataset per variable.

#subset the results at the climatic variable scale to compare values between runs
for(var in names_climatic_variables){
  subset_name <- paste0(var, "_BAYPASS_all_runs") #name the output dataframe of the loop
  assign(subset_name, subset(data_tot_results_allRUNs, climatic_variables..c.1..2.. == var))#assign the name to the dataframe in a loop
}
#list of dataset with data of all run for each climatic data
final_names_dataframe <- c("Annual_Tc_BAYPASS_all_runs","Diurnal_range_Tc_BAYPASS_all_runs","Tc_Seasonality_BAYPASS_all_runs","Tc_driest_quarter_BAYPASS_all_runs","Annual_P_BAYPASS_all_runs","P_Seasonality_BAYPASS_all_runs")
#name climatic data
names_climatic_variables <- colnames(climatic_data[,-c(1,2)])

Then, we wanted to see if these values were consistent across runs by testing the correlation of the values across runs:

#correlation between runs for BF 
list_title <- c("BIO 1","BIO 2","BIO 4","BIO 9","BIO 12", "BIO 15")

for(x in 1:6){
  var <- final_names_dataframe[x]
  title_name <- list_title[x]
  #names of the corr matrix for each biovariable
  names_corr <- paste0("correlation_",var,"_BF")
  #title of corrplot
  title <- (paste0(title_name," all runs"))
  
  #group for each bio var with the values of Bayes factor only 
corr_bio <- cor(get(var)[, grepl("BF", names(get(var)))]) 

#name the corr_bio with names_corr
assign(names_corr,corr_bio)

#plot corrplot
corr_plot <- corrplot(get(names_corr), method = "number", addrect = 2, col = c("red", "white", "red"), type = "lower", tl.col = "black", tl.cex = 1, number.cex = 1, title = title,mar=c(0,0,1,0) )

#save
pdf(paste0("C:/Users/tfrancisco/Documents/Thesis/Results/species/taxus/GEA_new_var/BAYPASS/figures/correlation_BF_values_across_runs_",names_climatic_variables[x],".pdf"));corrplot(get(names_corr), method = "number", addrect = 2, col = c("red", "white", "red"), type = "lower", tl.col = "black", tl.cex = 1, number.cex = 1, title = title,mar=c(0,0,1,0) );dev.off()
}

Interpretation: Bayes factor (BF) values show high consistency across runs, with correlations for all climatic variables ≥ 0.84.

Conclusion: Candidate SNPs can be identified based on the mean BF across runs, indicating robust associations with the climatic variables.

for(var in final_names_dataframe){
  #names of the corr matrix for each biovariable
  names_corr <- paste0("correlation_",var,"_eBPis")
  #title of corrplot
  title <- (paste0(var,"_eBPis"))
  
#group for each bio var with the values of empirical bayesian pvalues only 
corr_bio <- cor(get(var)[, grepl("eBPis", names(get(var)))]) 

#name the corr_bio with names_corr
assign(names_corr,corr_bio)

#plot corrplot
#corr_plot <- corrplot(get(names_corr), method = "number", addrect = 2, col = c("red", "white", "red"), type = "lower", tl.col = "black", tl.cex = 0.6, number.cex = 0.6, title = title,mar=c(0,0,1,0) )

}

Interpretation: Empirical Bayesian p-values (eBPis) show high consistency across runs, with correlations for all climatic variables exceeding 0.80. Conclusion: We can also identify candidate SNPs based on the mean eBPis across runs.

5 Candidate detection

5.1 Thresholds

5.1.1 Bayes factor values

To select candidate SNPs, we will set a threshold based on the Bayes factor (BF) and empirical Bayesian p-values (eBPis). Following Gautier (2015), we will apply Jeffreys’ rule (Jeffreys, 1961) to quantify the strength of evidence in favour of the association between the SNP and the covariate, using the following dB unit scale:
- 10 < BF < 15 = strong evidence
- 15 < BF < 20 = very strong evidence
- BF > 20 = conclusive evidence

In addition, we considered eBPis, with values greater than 3 indicating candidates, as suggested by Ruiz Daniels et al. (2019). We also considered the top 1%, top 0.5% or top 100 SNPs with the highest correlation with the climatic variables.

To begin this selection process, we calculate the mean BF values across runs for each SNP.

final_names_dataframe <- c("Annual_Tc_BAYPASS_all_runs","Diurnal_range_Tc_BAYPASS_all_runs","Tc_Seasonality_BAYPASS_all_runs","Tc_driest_quarter_BAYPASS_all_runs","Annual_P_BAYPASS_all_runs","P_Seasonality_BAYPASS_all_runs")

#mean BF values 
for(data in final_names_dataframe){
  #names of the corr matrix for each biovariable
  names_corr <- paste0("mean_",data,"_BF_values")
  
  #select only the BF values
  dataset <- get(data)[, grepl("BF", names(get(data)))]
  
  #mean
  mean_BF_values <- data.frame(names_snps,rowMeans(dataset)); colnames(mean_BF_values)=c("names_snps","BF_values")
    
  
#name the corr_bio with names_corr
assign(names_corr,mean_BF_values)
}
list_mean_BF_clim <- c("mean_Annual_Tc_BAYPASS_all_runs_BF_values","mean_Diurnal_range_Tc_BAYPASS_all_runs_BF_values","mean_Tc_Seasonality_BAYPASS_all_runs_BF_values","mean_Tc_driest_quarter_BAYPASS_all_runs_BF_values","mean_Annual_P_BAYPASS_all_runs_BF_values","mean_P_Seasonality_BAYPASS_all_runs_BF_values")

Next, we will identify candidates by applying a Bayes Factor (BF) threshold of 10 for each climatic variable.

thres_BF <- 10 #threshold of BF

for(x in 1:6){
  #names of the corr matrix for each biovariable
  names_climatic_var <- names_climatic_variables[x]
  data <- get(list_mean_BF_clim[x])
  
 selection_outliers <- data.frame(Loci=data$names_snps[which(data$BF_values>thres_BF)],BF = data$BF_values[which(data$BF_values>thres_BF)], climatic_variables=names_climatic_var)
 
 assign(paste0("names_outliers_",names_climatic_var),selection_outliers)
 
 # Count the number of candidates for each climatic variable
  count <- data.frame(Climatic_variable=names_climatic_var,Number_outliers=nrow(selection_outliers))
  
  #name the corr_bio with names_corr
assign(paste0("outliers_",names_climatic_var),count)
}
  # Combine the results for each climatic variable to see the number of candidates
  outliers_summary_BF10 <- rbind(outliers_Annual_Tc,outliers_Diurnal_range_Tc,outliers_Tc_Seasonality,outliers_Tc_driest_quarter,outliers_Annual_P,outliers_P_Seasonality)


#name of the candidates
  outliers_names_summary_BF_10 <- rbind(names_outliers_Annual_Tc,names_outliers_Diurnal_range_Tc,names_outliers_Tc_Seasonality,names_outliers_Tc_driest_quarter,names_outliers_Annual_P,names_outliers_P_Seasonality)

  #finally we can search for same candidates across climatic variables
duplicated_loci_BF10 <- duplicated(outliers_names_summary_BF_10$Loci) | duplicated(outliers_names_summary_BF_10$Loci, fromLast = TRUE)

# Subset the data frame to show only the duplicated loci
duplicated_outliers_BF10 <- outliers_names_summary_BF_10[duplicated_loci_BF10, ]

#Number of candidates: 
nrow(outliers_names_summary_BF_10)
## [1] 67
#at the climatic var scale
outliers_summary_BF10
##   Climatic_variable Number_outliers
## 1         Annual_Tc               9
## 2  Diurnal_range_Tc               7
## 3    Tc_Seasonality              23
## 4 Tc_driest_quarter               8
## 5          Annual_P               3
## 6     P_Seasonality              17

We can see that 67 candidates are identified using BAYPASS:
- Bio 1: 9
- Bio 2: 7
- Bio 4: 23
- Bio 9: 8
- Bio 12: 3
- Bio 15: 17

In addition, 4 candidates are identified by 2 climatic variables.

Finally, we saved the set of candidates for downstream analysis

outliers_T_adapcon_gentree_BAYPASS_BF_10 <- outliers_names_summary_BF_10[,-2]

#write_xlsx(outliers_T_adapcon_gentree_BAYPASS_BF_10,"C:/Users/tfrancisco/Documents/Thèse/Results/species/taxus/GEA_new_var/outliers/outliers_T_adapcon_gentree_BAYPASS_BF_10.xlsx")
save(outliers_T_adapcon_gentree_BAYPASS_BF_10, file="C:/Users/tfrancisco/Documents/Thesis/Results/species/taxus/GEA_new_var/outliers/outliers_T_adapcon_gentree_BAYPASS_BF_10.Rdata")

We can also create a less conservative dataset with a threshold of BF>8:

thres_BF <- 8 #relax threshold

for(x in 1:6){
  #names of the corr matrix for each biovariable
  names_climatic_var <- names_climatic_variables[x]
  data <- get(list_mean_BF_clim[x])
  
 selection_outliers <- data.frame(Loci=data$names_snps[which(data$BF_values>thres_BF)],BF = data$BF_values[which(data$BF_values>thres_BF)], climatic_variables=names_climatic_var)
 
 assign(paste0("names_outliers_",names_climatic_var),selection_outliers)
 
 # Count the number of candidates for each climatic variable
  count <- data.frame(Climatic_variable=names_climatic_var,Number_outliers=nrow(selection_outliers))
  
  #name the corr_bio with names_corr
assign(paste0("outliers_",names_climatic_var),count)
}
  # Combine the results for each climatic variable to see the number of candidates
  outliers_summary_BF_8 <- rbind(outliers_Annual_Tc,outliers_Diurnal_range_Tc,outliers_Tc_Seasonality,outliers_Tc_driest_quarter,outliers_Annual_P,outliers_P_Seasonality)

#name of the candidates
  outliers_names_summary_BF8 <- rbind(names_outliers_Annual_Tc,names_outliers_Diurnal_range_Tc,names_outliers_Tc_Seasonality,names_outliers_Tc_driest_quarter,names_outliers_Annual_P,names_outliers_P_Seasonality)
  
#finally we can search for same candidates across climatic variables

duplicated_loci_BF_8 <- duplicated(outliers_names_summary_BF8$Loci) | duplicated(outliers_names_summary_BF8$Loci, fromLast = TRUE)

# Subset the data frame to show only the duplicated loci
duplicated_outliers_BF_8 <- outliers_names_summary_BF8[duplicated_loci_BF_8, ]

#Number of candidates: 
outliers_summary_BF_8
##   Climatic_variable Number_outliers
## 1         Annual_Tc              13
## 2  Diurnal_range_Tc              17
## 3    Tc_Seasonality              35
## 4 Tc_driest_quarter              20
## 5          Annual_P              13
## 6     P_Seasonality              35
#at the climatic var scale
nrow(outliers_names_summary_BF8)
## [1] 133

We can see that 133 candidates are identified using BAYPASS:
- Bio 1: 13
- Bio 2: 17 - Bio 4: 35
- Bio 9: 20
- Bio 12: 13
- Bio 15: 35

In addition 8 candidates are identified by 2 climatic variables.

Finally, we saved the set of candidates for downstream analysis

outliers_T_adapcon_gentree_BAYPASS_BF_8 <- outliers_names_summary_BF8[,-2]

#write_xlsx(outliers_T_adapcon_gentree_BAYPASS_BF_8,"C:/Users/tfrancisco/Documents/Thèse/Results/species/taxus/GEA/outliers/outliers_T_adapcon_gentree_BAYPASS_BF_8.xlsx")
save(outliers_T_adapcon_gentree_BAYPASS_BF_8, file="C:/Users/tfrancisco/Documents/Thesis/Results/species/taxus/GEa_new_var/outliers/outliers_T_adapcon_gentree_BAYPASS_BF_8.Rdata")

5.1.2 Overlapping SNPs across 5 runs

Another method of identifying outliers is to look for overlapping SNPs across 5 runs, similar to the approach used in GF, rather than relying on the Bayes factor (BF) means.

list_clim <- c("Annual_Tc","Diurnal_range_Tc","Tc_Seasonality","Tc_driest_quarter","Annual_P","P_Seasonality")
list_runs <- c("RUN_1_BAYPASS_results","RUN_2_BAYPASS_results","RUN_3_BAYPASS_results","RUN_4_BAYPASS_results","RUN_5_BAYPASS_results")

for(x in 1:length(list_runs)){

  data <- get(list_runs[x])

for(i in 1:length(list_clim)){
  
  clim_name <- list_clim[i]
  
  data_clim <- data %>% filter(climatic_variables..c.1..2.. == clim_name)
   
 #BF values > 8
outliers_BF_8 <- data_clim[,c(12,2,8)] %>% filter(BF.dB.> 8) %>% pull(names_snps) 

 assign(paste0("Run",x,"_outliers_BF_8_",clim_name),outliers_BF_8)

 #BF values > 10
outliers_BF_10 <- data_clim[,c(12,2,8)] %>% filter(BF.dB.> 10) %>% pull(names_snps) 

 assign(paste0("Run",x,"_outliers_BF_10_",clim_name),outliers_BF_10)
 
  }
}
list_threshold <- c(8,10)

for(i in 1:length(list_threshold)){
  
threshold <- list_threshold[i]

for(x in 1:length(list_clim)){
  
  clim_var <- list_clim[x]
  data1 <- get(paste0("Run",1,"_outliers_BF_",threshold,"_",clim_var))
  data2 <- get(paste0("Run",2,"_outliers_BF_",threshold,"_",clim_var))
  data3 <- get(paste0("Run",3,"_outliers_BF_",threshold,"_",clim_var))
  data4 <- get(paste0("Run",4,"_outliers_BF_",threshold,"_",clim_var))
  data5 <- get(paste0("Run",5,"_outliers_BF_",threshold,"_",clim_var))
  
  #Select only the candidates identified in all 5 runs
outliers_set <- Reduce(intersect, list(data1,data2,data3,data4,data5))

  assign(paste0("outliers_",clim_var,"_",threshold),outliers_set)  
  }
}

See the final set of outliers

list_threshold <- c(8,10) # Replace with your actual threshold values
clim_vars <- c("Annual_Tc", "Diurnal_range_Tc", "Tc_Seasonality", "Tc_driest_quarter", "Annual_P", "P_Seasonality")

df_tot_outliers_list <- list()

for (x in 1:length(list_threshold)) {
  threshold <- list_threshold[x]
  combined_data <- data.frame()

  for (var in clim_vars) {
    data <- data.frame(set = get(paste0("outliers_", var, "_", threshold)))
    data$var <- var
    combined_data <- rbind(combined_data, data)
  }

  unique_data <- combined_data[!duplicated(combined_data$set),]  # Ensuring unique rows
  df_tot_outliers_list[[threshold]] <- unique_data

  assign(paste0("df_tot_outliers_", threshold), unique_data)
}
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.2 (2023-10-31 ucrt)
##  os       Windows 11 x64 (build 26100)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  French_France.utf8
##  ctype    French_France.utf8
##  tz       Europe/Paris
##  date     2025-09-22
##  pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package      * version    date (UTC) lib source
##  abind          1.4-8      2024-09-12 [1] CRAN (R 4.3.3)
##  backports      1.4.1      2021-12-13 [1] CRAN (R 4.3.1)
##  base64enc      0.1-3      2015-07-28 [1] CRAN (R 4.3.1)
##  broom          1.0.8      2025-03-28 [1] CRAN (R 4.3.3)
##  bslib          0.6.1      2023-11-28 [1] CRAN (R 4.3.2)
##  cachem         1.0.8      2023-05-01 [1] CRAN (R 4.3.1)
##  car            3.1-2      2023-03-30 [1] CRAN (R 4.3.2)
##  carData        3.0-5      2022-01-06 [1] CRAN (R 4.3.2)
##  cellranger     1.1.0      2016-07-27 [1] CRAN (R 4.3.2)
##  cli            3.6.1      2023-03-23 [1] CRAN (R 4.3.1)
##  colorspace     2.1-0      2023-01-23 [1] CRAN (R 4.3.1)
##  corrplot     * 0.92       2021-11-18 [1] CRAN (R 4.3.2)
##  curl           5.2.0      2023-12-08 [1] CRAN (R 4.3.2)
##  data.table   * 1.15.0     2024-01-30 [1] CRAN (R 4.3.2)
##  devtools       2.4.5      2022-10-11 [1] CRAN (R 4.3.3)
##  digest         0.6.33     2023-07-07 [1] CRAN (R 4.3.1)
##  dplyr        * 1.1.4      2023-11-17 [1] CRAN (R 4.3.2)
##  ellipsis       0.3.2      2021-04-29 [1] CRAN (R 4.3.1)
##  evaluate       0.23       2023-11-01 [1] CRAN (R 4.3.2)
##  fastmap        1.1.1      2023-02-24 [1] CRAN (R 4.3.1)
##  fs             1.6.3      2023-07-20 [1] CRAN (R 4.3.1)
##  geigen       * 2.3        2019-05-30 [1] CRAN (R 4.3.1)
##  generics       0.1.4      2025-05-09 [1] CRAN (R 4.3.2)
##  ggplot2      * 3.5.2      2025-04-09 [1] CRAN (R 4.3.2)
##  glue           1.7.0      2024-01-09 [1] CRAN (R 4.3.2)
##  gtable         0.3.6      2024-10-25 [1] CRAN (R 4.3.3)
##  here         * 1.0.1      2020-12-13 [1] CRAN (R 4.3.3)
##  highr          0.10       2022-12-22 [1] CRAN (R 4.3.1)
##  hms            1.1.3      2023-03-21 [1] CRAN (R 4.3.2)
##  htmltools      0.5.7      2023-11-03 [1] CRAN (R 4.3.2)
##  htmlwidgets    1.6.4      2023-12-06 [1] CRAN (R 4.3.2)
##  httpuv         1.6.13     2023-12-06 [1] CRAN (R 4.3.2)
##  httr           1.4.7      2023-08-15 [1] CRAN (R 4.3.2)
##  import         1.3.2      2024-01-21 [1] CRAN (R 4.3.3)
##  jquerylib      0.1.4      2021-04-26 [1] CRAN (R 4.3.1)
##  jsonlite       1.8.8      2023-12-04 [1] CRAN (R 4.3.2)
##  knitr          1.45       2023-10-30 [1] CRAN (R 4.3.2)
##  later          1.3.2      2023-12-06 [1] CRAN (R 4.3.2)
##  lattice        0.22-5     2023-10-24 [2] CRAN (R 4.3.2)
##  lazyeval       0.2.2      2019-03-15 [1] CRAN (R 4.3.2)
##  lifecycle      1.0.4      2023-11-07 [1] CRAN (R 4.3.2)
##  lubridate    * 1.9.3      2023-09-27 [1] CRAN (R 4.3.2)
##  magrittr     * 2.0.3      2022-03-30 [1] CRAN (R 4.3.1)
##  markdown       1.12       2023-12-06 [1] CRAN (R 4.3.2)
##  MASS           7.3-60.0.1 2024-01-13 [2] CRAN (R 4.3.2)
##  memoise        2.0.1      2021-11-26 [1] CRAN (R 4.3.1)
##  mime           0.12       2021-09-28 [1] CRAN (R 4.3.1)
##  miniUI         0.1.1.1    2018-05-18 [1] CRAN (R 4.3.2)
##  mnormt         2.1.1      2022-09-26 [1] CRAN (R 4.3.1)
##  munsell        0.5.1      2024-04-01 [1] CRAN (R 4.3.3)
##  mvtnorm      * 1.3-1      2024-09-03 [1] CRAN (R 4.3.3)
##  nlme           3.1-164    2023-11-27 [2] CRAN (R 4.3.2)
##  patchwork      1.2.0      2024-01-08 [1] CRAN (R 4.3.2)
##  pillar         1.10.2     2025-04-05 [1] CRAN (R 4.3.3)
##  pkgbuild       1.4.3      2023-12-10 [1] CRAN (R 4.3.2)
##  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.3.2)
##  pkgload        1.3.4      2024-01-16 [1] CRAN (R 4.3.2)
##  plotly         4.10.4     2024-01-13 [1] CRAN (R 4.3.2)
##  png            0.1-8      2022-11-29 [1] CRAN (R 4.3.1)
##  profvis        0.3.8      2023-05-02 [1] CRAN (R 4.3.2)
##  promises       1.2.1      2023-08-10 [1] CRAN (R 4.3.2)
##  psych          2.4.1      2024-01-18 [1] CRAN (R 4.3.2)
##  purrr          1.0.2      2023-08-10 [1] CRAN (R 4.3.2)
##  R6             2.6.1      2025-02-15 [1] CRAN (R 4.3.3)
##  radiant.data * 1.6.3      2023-12-17 [1] CRAN (R 4.3.3)
##  randomizr      1.0.0      2023-08-10 [1] CRAN (R 4.3.3)
##  Rcpp           1.0.11     2023-07-06 [1] CRAN (R 4.3.1)
##  readr          2.1.5      2024-01-10 [1] CRAN (R 4.3.2)
##  readxl         1.4.3      2023-07-06 [1] CRAN (R 4.3.2)
##  remotes        2.5.0      2024-03-17 [1] CRAN (R 4.3.3)
##  rlang          1.1.3      2024-01-10 [1] CRAN (R 4.3.2)
##  rmarkdown      2.25       2023-09-18 [1] CRAN (R 4.3.1)
##  rprojroot      2.0.4      2023-11-05 [1] CRAN (R 4.3.2)
##  rstudioapi     0.15.0     2023-07-07 [1] CRAN (R 4.3.2)
##  sass           0.4.8      2023-12-06 [1] CRAN (R 4.3.2)
##  scales         1.3.0      2023-11-28 [1] CRAN (R 4.3.2)
##  sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.3.2)
##  shiny          1.8.0      2023-11-17 [1] CRAN (R 4.3.2)
##  shinyAce       0.4.2      2022-05-06 [1] CRAN (R 4.3.3)
##  shinyFiles     0.9.3      2022-08-19 [1] CRAN (R 4.3.3)
##  stringi        1.8.3      2023-12-11 [1] CRAN (R 4.3.2)
##  stringr      * 1.5.1      2023-11-14 [1] CRAN (R 4.3.2)
##  tibble         3.2.1      2023-03-20 [1] CRAN (R 4.3.2)
##  tidyr        * 1.3.1      2024-01-24 [1] CRAN (R 4.3.2)
##  tidyselect     1.2.1      2024-03-11 [1] CRAN (R 4.3.3)
##  timechange     0.3.0      2024-01-18 [1] CRAN (R 4.3.2)
##  tzdb           0.4.0      2023-05-12 [1] CRAN (R 4.3.2)
##  urlchecker     1.0.1      2021-11-30 [1] CRAN (R 4.3.2)
##  usethis        2.2.3      2024-02-19 [1] CRAN (R 4.3.2)
##  vctrs          0.6.5      2023-12-01 [1] CRAN (R 4.3.2)
##  viridisLite    0.4.2      2023-05-02 [1] CRAN (R 4.3.2)
##  withr          3.0.2      2024-10-28 [1] CRAN (R 4.3.3)
##  writexl      * 1.5.0      2024-02-09 [1] CRAN (R 4.3.2)
##  xfun           0.41       2023-11-01 [1] CRAN (R 4.3.2)
##  xtable         1.8-4      2019-04-21 [1] CRAN (R 4.3.2)
##  yaml           2.3.8      2023-12-11 [1] CRAN (R 4.3.2)
## 
##  [1] C:/Users/tfrancisco/AppData/Local/R/win-library/4.3
##  [2] C:/Users/tfrancisco/AppData/Local/Programs/R/R-4.3.2/library
## 
## ──────────────────────────────────────────────────────────────────────────────