1 Introduction

This script processes climatic and environmental data from the raw output of ClimateDT (or any program providing raster TIFF data) into a dataframe containing standardized values for the selected variables under past, present, and future climatic conditions. High-resolution climatic data (30 arc-seconds) were obtained from the Climate Downscaling Tool (ClimateDT; Marchi et al., 2024), a dynamic web tool that provides elevation-adjusted historical and future climatic data. The period 1901–1950 was used as the reference for local adaptation of populations, and the period 2041–2060 was used as the future scenario, with mean values derived from five global climate models (GCMs) under the socio-economic pathway SSP3-7.0.

To identify the final set of climatic predictors for downstream analyses, we first focused on a pre-selected set of variables relevant to the species’ ecology. For past climatic data, forward selection was performed using the ordiR2step function from the vegan R package (Oksanen et al., 2001) to identify variables most closely related to genomic variation across populations. Next, variables with high collinearity (correlation coefficient > 0.75) were removed, prioritizing retention of the most informative variables identified during forward selection. Finally, variance inflation factors (VIFs) were calculated, and additional variables were removed to ensure VIF values were below 10.

After these selection steps, the retained climatic variables were standardized, and the processed datasets for past, present, and future scenarios were exported in multiple formats for downstream analyses.

meta_data_pop <- read.csv("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Populations/taxus_sample_29pop.csv",h=T,sep=";",dec=",")
meta_data_vcf <- read.csv("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/samples/samples_taxus_baccata_adapcon_gentree.csv",h=T,sep=";",dec=",")

2 Pre-selection of climatic variables

he first step involved reviewing the literature on the target species to define a set of climatic variables that could potentially drive adaptive genetic variation across its range. The aim was to retain only biologically informative variables, reducing the number of predictors tested and allowing refinement of hypotheses prior to analysis. The goal was not to limit the dataset to only 2–3 variables, nor to retain all available variables, but to pre-select a moderate set—typically 10–15 variables depending on the species and the depth of relevant literature.

For T. baccata, the initial pre-selection included the following variables: Bio 1, 2, 3, 4, 5, 6, 7, 9, 10, 12, 13, 14, 15, 16, 17, 18, MGSP, AHM, SHM.

3 Climatic data

3.1 Data

Climatic data were obtained from ClimateDT in two formats: point layers and rasters, for both past and future conditions. A key difference between past and future data lies in their temporal resolution:

Past data provide one value per year for each variable at each population/pixel. For example, for 1901–1950, there are 50 values per variable per population.

Future data are less detailed, providing only mean values for the selected period; for example, 2041–2070 has one value per variable per population.

Many bioclimatic variables (Bio1–Bio19) are derived from monthly minimum temperature (tmin), maximum temperature (tmax), and precipitation. Therefore, the average of yearly Bio1 values cannot be directly calculated for a period. Instead, tmin, tmax, and precipitation are averaged for each month across the period (e.g., one tmin value for January, one for February, etc.). These monthly averages are then used with the biovars function from the Dismo R package to calculate the 19 bioclimatic variables for the period of interest (e.g., 1901–1950). Other derived variables, such as AHM, can be calculated following the ClimateDT procedures using Bio1, Bio10, and Bio12 values from biovars.

For variables not directly linked to tmin, tmax, precipitation, or bioclimatic variables (e.g., GDD5), obtaining period-specific averages is less straightforward, thus these data were provided directly by Maurizio Marchi (IBBR-CNR).

It is important to note that future climatic predictions can vary depending on the model used. To avoid bias from extreme model predictions, mean values from multiple models can be used; however, accessing multiple models is only possible via raster data.

Based on previous testing with the script Processing_climatic_data_T_adapcon_Gentree_bis, differences between point layers and raster data were significant for temperature-related variables. Therefore, we used raster data for the 1901–1950 period to maintain consistency with future raster-based predictions and avoid discrepancies due to the data extraction method

Load past climatic data

#Create an R object where all the climatic rasters are.
ras.bio <- stack(list.files("C:/Users/tfrancisco/Documents/Thesis/Data/Climatic/1901-1950_raster/", pattern = ".tif$", full.names = T)) 
#Names of the raster (because they are not in the classic order Bio 1, 2 etc they are order by number so bio 1, 10, 11 etc)
names(ras.bio) <- unlist(strsplit(unlist(lapply(strsplit(list.files("C:/Users/tfrancisco/Documents/Thesis/Data/Climatic/1901-1950_raster/", pattern = ".tif$", full.names = T), split = "C:/Users/tfrancisco/Documents/Thesis/Data/Climatic/1901-1950_raster/"), function(x) x[2])), split = ".tif"))

#We extracted the climatic values for each populations from the raster based on their coordinates
coords<- data.frame(apply(meta_data_pop[,c(5:4)], 2, as.numeric))#We need to have longitude then latitude
past_climatic_data_raster <- data.frame(meta_data_pop$Population,raster::extract(ras.bio, coords))#Use raster::function_name to avoid confusion with tidyr functions that have the same name
colnames(past_climatic_data_raster)<- c("Population",names(ras.bio))

#Add country info 
past_climatic_data_f <- merge(meta_data_pop[,c(1,2)],past_climatic_data_raster, "Population")

3.2 Climatic variations among populations

#Do the violin plot in a loop for each climatic variables

#Input data for the loop
data <- past_climatic_data_f
regions <-past_climatic_data_f$Country
bioclim_vars <- c("bio1", "bio2", "bio3", "bio4","bio5","bio6","bio7","bio9", "bio10","bio12","bio13","bio14","bio15","bio16", "bio17", "bio18","AHM", "MGSP", "SHM", "GDD5")
unit_var <- c("°c","°c","index","°c","°c","°c","°c","°c","°c","mm","mm","mm","index","mm","mm","mm","°c/mm","mm","°c/mm","°c*days" )

#Results: the loop
for (x  in 1:length(bioclim_vars)) {
  var <- bioclim_vars[x]
   unit <- unit_var[x]
  data_subset <- data.frame(data[, paste0(var)]) #For each variable in bioclim_vars, we extract the two periods and created a data.frame with only these two variables. 
  colnames(data_subset) <- c("var1")
  #This create scatterplot to compare the bioclimatic variable between the two periods
  plot <- ggplot(data_subset, aes(x =var,y= var1)) +#War is the values and var1 is just the name of the variable 
  geom_violin(trim = FALSE) +
  geom_point(aes(color = regions), shape = 16, position = position_jitter(seed = 1, width = 0.2)) +
  labs(colour="Countries",x="variable",y=paste0("values ","(",unit,")"),title=paste0("Violin Plot of ",var," differences across populations"))+
  theme(plot.title = element_text(hjust = 0.5))
  
  #Save
png(paste0("C:/Users/tfrancisco/Documents/Thesis/Results/species/taxus/climate/past_climatic_data/violin_plot/violin_plot_",var,".png"));print(plot)
;dev.off()
}

Example of violin plot:

Interpretation: We observed differences in climatic values across populations, suggesting that local adaptation could arise between populations, as distinct climatic selective pressures appear to affect different populations.

4 Selection of climatic data

The selection of climatic variables was considered through three more steps:

4.1 Imprecision along studied area

The accuracy of climatic variables is not uniform across the study area or among variables themselves. If uncertainty layers were available, variables with high inaccuracy could be removed. However, this assessment was not feasible for the present study.

4.2 Explained variance

To identify which climatic variables explain the most genetic variation, we conducted a Redundancy Analysis (RDA; see RDA_candidate_detection script for details). Using the ordiR2step function from the vegan R package, we started with a null model containing no explanatory variables and progressed to a full model including all potential explanatory variables. Variables were retained based on their significance in increasing model R², using a p-value threshold of < 0.05 over 1,000 permutations. The most important variables explaining genomic variation were identified, and twenty independent runs were conducted to assess the consistency of variable detection.

#Genomic data
load("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/genetic_new/data_allelic_frequencies_29pop_adapcon_gentree_475_8616.Rdata")
genomic_matrix <- data_allelic_frequencies_29pop_adapcon_gentree_475_8616

Climatic variables were scaled to standardise their range and variance prior to downstream analyses

past_climatic_data_scale <- past_climatic_data_f[,-c(1,2)]%>%
  apply(2,as.numeric) %>% as.data.frame %>% 
mutate(across(where(is.numeric), scale))

We can perform the RDA:

#Null model
RDA_null <- rda(genomic_matrix~1,past_climatic_data_scale )

#Model with all the variables
RDA_full <- rda(
formula=genomic_matrix~bio1+bio2+bio3+bio4+bio5+bio6+bio7+bio9+bio10+bio12+bio13+bio14+bio15+bio16+bio17+bio18+AHM+MGSP+SHM+GDD5,data = past_climatic_data_scale,scale=F)

#Stepwise by selecting only variables that increase the adjusted R^2 of the model and that have a pvalue lower than 0.05
Variable_selection_pv0.05 <- ordiR2step(RDA_null, RDA_full, Pin = 0.05, R2permutations = 1000, R2scope = T)

#Variable_selection_pv0.05$anova
names(Variable_selection_pv0.05$CCA$envcentre)

#we want, following Archambeau et al 2024, investigate if the permutation stepwise identified the same variables of interest depending on the run:
#we run 20 independent stepwise models.
nbmodels <- 20

# Stepwise selection with ordiR2step function
#rep20_variable_selec_pv0.05 <- lapply(1:nbmodels, function(x) {
  #mod <- ordiR2step(RDA_null, RDA_full, Pin = 0.05, R2permutations = 1000, R2scope = T)
  #return(names(mod$CCA$envcentre))}) %>% 
  #setNames(paste0("model",1:nbmodels)) %>% 
  #ldply(function(x) data.frame(variables=x),.id="models") %>%  
  #dplyr::summarise(count(variables)) %>% 
  #setNames(c("variable","count"))

rep20_variable_selec_pv0.05

##   variable count
## 1     bio2    20
## 2     bio4    20
## 3     bio7    20
## 4     bio9    20

Interpretation: Bio2, Bio4, Bio7, and Bio9 appear to be important climatic variables. They were consistently identified in all 20 runs of the ordiR2step analysis, each contributing to an increase in model R² with p-values < 0.05.

4.3 Multicollinearity Between Variables

To reduce multicollinearity, only variables with low pairwise correlation were retained. Specifically, variables with an absolute correlation coefficient < 0.75 were kept from the pre-selected set of climatic variables.

#Function to do it
#Matrix of correlation
correlation_function <-function(data,threshold){
    data_correlation <- subset(data,select= -c(Population,Country)) 
      rownames(data_correlation) <- data$Populations
      correlation <- cor(data_correlation)
  correlation[abs(correlation) <= threshold] <- 0
corr_plot <- corrplot(correlation, method = "number", addrect = 2, col = c("red", "white", "red"), type = "lower", tl.col = "black", tl.cex = 0.6, number.cex = 0.6)
#Save
png("C:/Users/tfrancisco/Documents/Thesis/Results/species/taxus/climate/selection_var/corrplot_multicollinearity.png");corr_plot <- corrplot(correlation, method = "number", addrect = 2, col = c("red", "white", "red"), type = "lower", tl.col = "black", tl.cex = 0.6, number.cex = 0.6);dev.off()

#Plot the corrplot
corr_plot
}

#Right order
past_climatic_data_order <- past_climatic_data_f[,c(1,2,4,13:19,5:12,3,20,21,22)]

#Correlation past/present
data_present <- past_climatic_data_order
threshold <- 0.75
correlation_past <-correlation_function(data_present,threshold)

## Warning in ind1:ind2: numerical expression has 2 elements: only the first used

## Warning in ind1:ind2: numerical expression has 2 elements: only the first used

Conclusion: Based on the correlation among variables and prior identification of the most important predictors of genetic variation, we selected BIO1, BIO2, BIO3, BIO4, BIO5, BIO9, BIO12, and BIO15.

4.4 Variance inflation factor (VIF)

To further reduce multicollinearity, we calculated the variance inflation factor (VIF) for the retained variables, ensuring that no highly correlated variables were included in the final dataset.

#Subset previous step
RDA_subset_step4 <- rda(formula=genomic_matrix~bio1+bio2+bio3+bio4+bio5+bio9+bio12+bio15,data = past_climatic_data_scale,scale=F)
#anova.cca(RDA_subset_step4)
#RsquareAdj(RDA_subset_step4)
##Vif
vif.cca(RDA_subset_step4)

##       bio1       bio2       bio3       bio4       bio5       bio9      bio12 
##  28.981154 166.091110  64.987887  65.456002  46.882875   3.491931   1.427049 
##      bio15 
##   1.769682

We observed that the VIF for some variables in the subset from previous step was excessively high (e.g., over 160 for BIO2). Considering this and the importance of certain variables for explaining genetic variation, we adjusted the subset to ensure that all VIF values were below 10.

#Adjusted subset based on VIF
RDA_adjusted_subset <- rda(formula=genomic_matrix~bio1+bio2+bio4+bio9+bio12+bio15,data = past_climatic_data_scale,scale=F)
#anova.cca(RDA_adjusted_subset)
#RsquareAdj(RDA_adjusted_subset)
##Vif
vif.cca(RDA_adjusted_subset)

##     bio1     bio2     bio4     bio9    bio12    bio15 
## 1.861089 2.470835 2.780922 2.708000 1.357544 1.768915

Interpretation: By removing BIO5 and BIO3, we successfully reduced the VIF for all retained variables to below 10 (even below 3).

Conclusion: we retained the subset: Bio 1, 2, 4, 9, 12, 15 : - Bio 1: Mean annual Temperature (°C) - Bio 2: Mean Diurnal Range (Mean of monthly (max temp - min temp)) (°C) - Bio 4: Temperature Seasonality (standard deviation x100) (°C) - Bio 9: Mean Temperature of Driest Quarter (°C) - Bio 12: Mean annual Precipitation (mm) - Bio 15: Precipitation Seasonality (mm)

5 Retained climatic data other periods

Based on the climatic variable selection, the following variables were retained for present and future periods: Bio 1, 2, 4, 9, 12, 15

5.1 Present

#Create an R object where all the climatic rasters are.
ras.bio_present <- stack(list.files("C:/Users/tfrancisco/Documents/Thesis/Data/Climatic/1991-2020_raster_climateDT/", pattern = ".tif$", full.names = T)) 
#Names of the raster (because they are not in the classic order Bio 1, 2 etc they are order by number so bio 1, 10, 11 etc)
names(ras.bio_present) <- unlist(strsplit(unlist(lapply(strsplit(list.files("C:/Users/tfrancisco/Documents/Thesis/Data/Climatic/1991-2020_raster_climateDT/", pattern = ".tif$", full.names = T), split = "C:/Users/tfrancisco/Documents/Thesis/Data/Climatic/1991-2020_raster_climateDT/"), function(x) x[2])), split = ".tif"))

#We extracted the climatic values for each populations from the raster based on their coordinates
coords<- data.frame(apply(meta_data_pop[,c(5:4)], 2, as.numeric))#We need to have longitude then latitude
present_climatic_data_Adapon_pop <- data.frame(meta_data_pop$Population,raster::extract(ras.bio_present, coords))#Important to add raster:: because this function is also in tidyr and will not do the same things
colnames(present_climatic_data_Adapon_pop)<- c("Population",names(ras.bio_present))

#Add country info 
present_climatic_data_Adapon_pop_f <- merge(meta_data_pop[,c(1,2)],present_climatic_data_Adapon_pop, "Population")

#Order
Present_new_6_Climatic_data_scale <-present_climatic_data_Adapon_pop_f[,c(1,2,3,6:8,4,5)]

#Name bio 
colnames(Present_new_6_Climatic_data_scale) <- c("Population","Country","Annual_Tc","Diurnal_range_Tc","Tc_Seasonality","Tc_driest_quarter","Annual_P","P_Seasonality")

##Future Climatic Data

Multiple climatic models are available for future projections. Selecting a single model without prior information could lead to bias, as predictions can differ substantially among models (Archambeau et al. 2024). To reduce this risk, we selected five widely used models for the 2041–2070 period under the SSP3-7.0 scenario:
- GFDL-ESM4
- IPSL-CM6A-LR
- MPI-ESM1-2-HR - MPI-ESM2-0
- UKESM1-0-LL

The first step involved loading the future climatic data for each of the selected climatic models.

#List of the model for the loop
list_models<- c("GFDL_ESM4","IPSL_CM6A_LR","MPI_ESM1_2_HR","MRI_ESM2_0","UKESM1_0_LL") 

for(x in 1:length(list_models)){
  
  model <- list_models[x]
#Create an R object where all the climatic rasters are
ras.bio_future <- stack(list.files(paste0("C:/Users/tfrancisco/Documents/Thesis/Data/Climatic/Future_climate/2041_2070_",model,"_ssp370/"), pattern = ".tif$", full.names = T)) 
#Names of the raster (because they are not in the classic order Bio 1, 2 etc they are order by number so bio 1, 10, 11 etc)
names(ras.bio_future) <- unlist(strsplit(unlist(lapply(strsplit(list.files(paste0("C:/Users/tfrancisco/Documents/Thesis/Data/Climatic/Future_climate/2041_2070_",model,"_ssp370/"), pattern = ".tif$", full.names = T), split = paste0("C:/Users/tfrancisco/Documents/Thesis/Data/Climatic/Future_climate/2041_2070_",model,"_ssp370/")), function(x) x[2])), split = ".tif"))

future_climatic_data <- data.frame(meta_data_pop$Population,raster::extract(ras.bio_future, coords))
colnames(future_climatic_data)=c("Population",names(ras.bio_future))
new_colnames<- paste(colnames(future_climatic_data[,-1]),model, sep = "_")
colnames(future_climatic_data)[-1] <- new_colnames

#Add country info 
future_climatic_data_f <- merge(meta_data_pop[,c(1,2)],future_climatic_data, "Population")
assign(paste0("Future_data_",model), future_climatic_data_f)
}

The next step involved grouping the data by climatic variable across all selected climatic models.

#List of the climatic var
list_climatic_var <- c("bio1","bio12","bio15","bio2","bio4","bio9")

for(x in 1:length(list_climatic_var)){
  climatic_var <- list_climatic_var[x]
  
  pos_clim <- x+2
  
#Climatic_values_ff <- Future_data_GFDL_ESM4[,x]
  
  data_set_clim <- data.frame(Population=Future_data_GFDL_ESM4$Population,Country=Future_data_GFDL_ESM4$Country,Future_data_GFDL_ESM4[,pos_clim],Future_data_IPSL_CM6A_LR[,pos_clim],Future_data_MPI_ESM1_2_HR[,pos_clim],Future_data_MRI_ESM2_0[,pos_clim],Future_data_UKESM1_0_LL[,pos_clim])
  
  colnames(data_set_clim)<- c("Population","Country",paste0(climatic_var,"_GFDL_ESM4"),paste0(climatic_var,"_IPSL_CM6A_LR"),paste0(climatic_var,"_MPI_ESM1_2_HR"),paste0(climatic_var,"_MRI_ESM2_0"),paste0(climatic_var,"_UKESM1_0_LL"))
  
  assign(paste0(climatic_var,"_future_climatic_data"),data_set_clim)
}

Next, we generated violin plots of the climatic values for each population across the different climatic models. One plot was created for each bioclimatic variable of interest.

#Input data for the loop
unit_var <- c("°c","mm","Coefficient of Variation","°c","sd°c x100","°c")
list_dataset <- c("bio1_future_climatic_data","bio12_future_climatic_data","bio15_future_climatic_data","bio2_future_climatic_data","bio4_future_climatic_data","bio9_future_climatic_data")

#Results: the loop
for (x  in 1:length(list_dataset)) {
  
  dataset <- get(list_dataset[x])
   unit <- unit_var[x]
var_name <- list_climatic_var[x]
  
data_violin <- gather(dataset,key="dataset",value="Value",-Population,-Country)

#Add the name of the model 
data_violin_f <- data_violin %>%
  mutate(models = gsub("^[^_]+_", "", dataset))

#Create scatterplot to compare the bioclimatic variable between the two periods
  plot <- ggplot(data_violin_f, aes(x = models,y= Value)) +
  geom_violin(trim = FALSE) +
  geom_point(aes(color = data_violin_f$Country), shape = 16, position = position_jitter(seed = 1, width = 0.2),size=2) +
  labs(colour="Country",x=var_name,y=paste0("values ","(",unit,")"),title=paste0("Violin Plot of ",var_name," differences across climatic models"))+
  theme_classic() +  #Removes the grey background
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(size = 16),
    axis.title.y = element_text(size = 16),
    axis.text.x = element_text(size = 8.3),
    axis.text.y = element_text(size = 12),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 12),
    legend.key = element_rect(fill = "transparent", color = NA),
    panel.grid.major = element_line(color = "grey"),  
    panel.grid.minor = element_line(color = "grey"), 
    panel.grid.major.x = element_line(color = "grey"), 
    panel.grid.major.y = element_line(color = "grey")
  ) +
     scale_x_discrete(labels = c("label1" = "name1", "label2" = "name2")) +
    guides(color = guide_legend(override.aes = list(size = 4))) #Increase legend point size
  
#Save
pdf(paste0("C:/Users/tfrancisco/Documents/Thesis/Results/species/taxus/climate/future_climatic_data/violin_plot_comp_models/Comparison_",var_name,"between_climatic_models.pdf"));print(plot)
;dev.off()

#Print the plot
  print(plot)
}

Although some differences between climatic models are visible, the values are generally similar. To quantify this similarity, we calculated a correlation matrix among the models.

list_names <- c("Bio 1","Bio 12","Bio 15","Bio 2","Bio 4","Bio 9")
for(x in 1:length(list_dataset)){
  df <- get(list_dataset[x])
  
  df_num <- df[,-c(1,2)] %>% mutate(across(as.numeric))
  
  #Names of the corr matrix for each biovariable
  names_corr <- paste0("correlation_",list_dataset[x])
  #Title of corrplot
  title <- (paste0("Correlation ",list_dataset[x]))
  
  name_title <- list_names[x]
  
  #Group for each bio var with the values of Bayes factor only 
corr_bio <- cor(df[, grepl("bio", names(df))]) 

#Name the corr_bio with names_corr
assign(names_corr,corr_bio)

names_corr
#Plot corrplot
corr_plot <- corrplot(get(names_corr), method = "number", addrect = 2, col = c("red", "white", "red"), type = "lower", tl.col = "black", tl.cex = 0.8, number.cex = 0.7,mar=c(0,0,1,0),title=paste0("Correlation ",name_title," future climatic data"))

#Save
pdf(paste0("C:/Users/tfrancisco/Documents/Thesis/Results/species/taxus/climate/future_climatic_data/Correlation/correlation_",list_dataset[x],".pdf"));corr_plot <- corrplot(get(names_corr), method = "number", addrect = 2, col = c("red", "white", "red"), type = "lower", tl.col = "black", tl.cex = 0.8, number.cex = 0.7,mar=c(0,0,1,0), title=paste0("Correlation ",name_title," future climatic data"));dev.off()
}

Climatic values were generally similar across models, except for BIO9, where the UKESM1-0-LL model predicted values that were inconsistent with the other models. This highlights the importance of caution when selecting future climatic values in the absence of prior information favoring one model. For subsequent analyses, we used all five climatic models to calculate genomic offset (GO) and compare predictions.

6 Standardisation of the retained climatic variables

The final step involved creating a data frame with the retained climatic variables standardised for downstream analyses. Present and future climatic values were standardised using the past data as a reference, allowing comparisons of values to build indices such as the genomic discrepancy index or Genomic Offset between periods.

6.1 Past data

#Dataframe not scaled
Past_clim_data_Europe_pop <- past_climatic_data_order[,c(1:4,6,10,12,15)]
#Creation of the scaled matrix
Past_new_6_Climatic_data_scale <-past_climatic_data_order[,c(3,4,6,10,12,15)] %>% 
    scale()

Past_new_6_Climatic_data_scale_df <- data.frame(past_climatic_data_order[,c(1,2)],Past_new_6_Climatic_data_scale)

colnames(Past_new_6_Climatic_data_scale_df) <- c("Population","Country","Annual_Tc","Diurnal_range_Tc","Tc_Seasonality","Tc_driest_quarter","Annual_P","P_Seasonality")

df_to_scale <- past_climatic_data_order[,c(3,4,6,10,12,15)]
colnames(df_to_scale) <- c("Annual_Tc","Diurnal_range_Tc","Tc_Seasonality","Tc_driest_quarter","Annual_P","P_Seasonality")

  scale_env_value_new_cli <- attr(scale(df_to_scale), 'scaled:scale')
center_env_value_new_cli <- attr(scale(df_to_scale), 'scaled:center')

6.2 Present

#Scale 
load("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Climatic_data/new_selection/scale_env_value_new_cli.Rdata")
load("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Climatic_data/new_selection/center_env_value_new_cli.Rdata")

Present_climatic_data_T_adapcon_gentree_scaled <- data.frame(Present_new_6_Climatic_data_scale[,c(1,2)],scale(Present_new_6_Climatic_data_scale[,-c(1,2)], center = center_env_value_new_cli, scale = scale_env_value_new_cli))

The raster containing the standardised climatic variables was saved for use in the RDA-based Genomic Offset analysis

6.3 Future data

6.3.1 Data frame format

Finally, future climatic data were scaled using the mean and standard deviation of the past climatic data.

for(x in 1:length(list_models)){
  
  model <- list_models[x]
  data <- get(paste0("Future_data_",model))
  data_final <- data.frame(data[,c(1:3,6:8,4,5)])
  colnames(data_final) <- c("Population","Country","Annual_Tc","Diurnal_range_Tc","Tc_Seasonality","Tc_driest_quarter","Annual_P","P_Seasonality")

Future_climatic_data_scaled <- data.frame(data_final[,c(1,2)],scale(data_final[,-c(1,2)], center = center_env_value_new_cli, scale = scale_env_value_new_cli))

assign(paste0("future_climatic_data_",model),data_final)
assign(paste0("future_climatic_data_scaled_",model),Future_climatic_data_scaled)

#Non scaled
write_xlsx(get(paste0("future_climatic_data_", model)),paste0("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Climatic_data/new_selection/future_climatic_data_",model,".xlsx"))

#Save scale df
write_xlsx(get(paste0("future_climatic_data_scaled_",model)),paste0("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Climatic_data/new_selection/future_climatic_data_scaled_",model,".xlsx"))
save(list=paste0("future_climatic_data_scaled_",model),file=paste0("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Climatic_data/new_selection/future_climatic_data_scaled_",model,".Rdata"))
}

6.3.2 Raster format

Next, we loaded the folder containing raster files and created a raster stack for all climatic variables for each climatic model.

for(x in 1:length(list_models)){
  
  model <- list_models[x]

future_climatic_data_raster <- stack(list.files(paste0("C:/Users/tfrancisco/Documents/Thesis/Data/Climatic/Future_climate/2041_2070_",model,"_ssp370/"), pattern = ".tif$", full.names = T)) 

future_climatic_data <- data.frame(meta_data_pop$Population,raster::extract(future_climatic_data_raster, coords))#Important to add raster:: 
#Names of the raster (because they are not in the classic order Bio 1, 2 etc they are order by number so bio 1, 10, 11 etc)
names(future_climatic_data_raster) <- unlist(strsplit(unlist(lapply(strsplit(list.files(paste0("C:/Users/tfrancisco/Documents/Thesis/Data/Climatic/Future_climate/2041_2070_",model,"_ssp370/"), pattern = ".tif$", full.names = T), split = paste0("C:/Users/tfrancisco/Documents/Thesis/Data/Climatic/Future_climate/2041_2070_",model,"_ssp370/")), function(x) x[2])), split = ".tif"))

assign(paste0("future_climatic_data_raster_",model),future_climatic_data_raster)

#Save file
save(list=paste0("future_climatic_data_raster_",model),file=paste0("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Genomic_offset/RDA/future_climatic_data_raster_",model,".Rdata"),F=T)
}

6.4 LFMM analysis

For the LFMM analysis, we created a data frame containing climatic values at the individual level. Individuals from the same population were assigned identical climatic values, as LFMM operates at the individual level.

#Load meta_data at individual level

climatic_data_indivdual_level_new_selection <- merge(meta_data_vcf[,-2],past_climatic_data_order[,c(1:4,6,10,12,15)],"Population")

climatic_data_indivdual_level_scaled_new_selection <- climatic_data_indivdual_level_new_selection[,-c(1:3)] %>% 
  scale()

climatic_data_indivdual_level_scaled_new_selection_df <- data.frame(climatic_data_indivdual_level_new_selection[,c(1:3)],climatic_data_indivdual_level_scaled_new_selection)

#Rename columns
colnames(climatic_data_indivdual_level_scaled_new_selection_df) <- c("Population","VCF_ID","Country","Annual_Tc","Diurnal_range_Tc","Tc_Seasonality","Tc_driest_quarter","Annual_P","P_Seasonality")

7 Comparison of the climatic envelope of T. baccata: European range vs. sampled population

We compared the climatic envelope, using the six climatic variables retained for the past period, between the 29 populations and the occurrence points of T. baccata across Europe.

7.1 Shapefile

7.1.1 Climatic data presence data

Load shapefile presence data

presence_shp <- st_read("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/shapefiles/presence_point_europe/euforgen.shp")

## Reading layer `euforgen' from data source 
##   `C:\Users\tfrancisco\Documents\Thesis\Data\Species\Taxus_baccata\shapefiles\presence_point_europe\euforgen.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1016 features and 2 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -28.3903 ymin: 35.75478 xmax: 27.84555 ymax: 62.78717
## CRS:           NA

We extracted coordinates from the shapefile (SHP) to enable retrieval of climatic data from the corresponding raster layers.

df_coord_data <- data.frame(Longitude=presence_shp$LONG_WGS84,Latitude=presence_shp$LAT_WGS84)

Extract climatic data

#First we need to extract the climatic data
df_coord_data$Longitude <- as.numeric(df_coord_data$Longitude);df_coord_data$Latitude <- as.numeric(df_coord_data$Latitude)

#Raster
ras.bio <- stack(list.files("C:/Users/tfrancisco/Documents/Thesis/Data/Climatic/1901-1950_raster/", pattern = ".tif$", full.names = T)) 
#Names of the raster (because they are not in the classic order Bio 1, 2 etc they are order by number so bio 1, 10, 11 etc)
names(ras.bio) <- unlist(strsplit(unlist(lapply(strsplit(list.files("C:/Users/tfrancisco/Documents/Thesis/Data/Climatic/1901-1950_raster/", pattern = ".tif$", full.names = T), split = "C:/Users/tfrancisco/Documents/Thesis/Data/Climatic/1901-1950_raster/"), function(x) x[2])), split = ".tif"))

#We extracted the climatic values for each populations from the raster based on their coordinates
clim_shp <- data.frame(raster::extract(ras.bio, df_coord_data))#Important to add raster:: because this function is also in tidyr and will not do the same things
colnames(clim_shp)<- c(names(ras.bio))

final_df_presence_shp <- clim_shp[-5,c(2,11,13,17,4,7)]

final_df_presence_shp$group <- "range-wide"

7.1.2 Combine datasets and scale climatic variables for comparability across periods and populations

Past_clim_data_Europe_pop

##                     Population     Country bio1 bio2  bio4 bio9  bio12 bio15
## 1                       Ancona       Italy 10.0  9.4 676.9  2.6 1032.0  16.7
## 2                     Blindtal Switzerland  3.7  9.0 756.9 -5.7 1183.0  11.7
## 3       Bohinj_Obrne_Mokri_Log    Slovenia  6.6  8.5 741.2 -1.4 2147.3  25.1
## 4               Brandvik_Stord      Norway  6.2  4.5 545.2  8.1 2101.2  28.6
## 5            Bujaruelo-ADAPCON       Spain  4.8  9.3 614.0 12.6 1475.3  22.9
## 6          Bujaruelo-GENTREE-1       Spain  6.4  9.4 613.4 14.2 1447.4  23.8
## 7                        Cardo       Spain 12.5  9.6 596.3 20.0  599.1  38.1
## 8             Castle_Eden_Dene          UK  8.5  5.3 416.1  5.0  685.1  17.9
## 9  Fonte_tassete_Prati_di_Tivo       Italy  5.5  9.5 652.0 13.4 1182.2  21.3
## 10               Foresta_Umbra       Italy 11.5  5.1 602.0 19.0  655.0  32.0
## 11                Gorenja_zaga    Slovenia  9.5  8.4 754.8  1.2 1469.6  22.2
## 12                    Harmanec    Slovakia  4.7  7.9 741.4 -3.8  999.7  16.9
## 13       Hornli-Hulftegg-Roten Switzerland  4.8  8.6 641.5 -3.1 1964.2  27.4
## 14                        Jura Switzerland  7.7  8.7 649.5  1.0 1448.2  16.4
## 15                 Mt_Cholomon      Greece 11.2  8.9 710.7 19.6  659.4  28.1
## 16                  Mt_Olympus      Greece 10.2  6.4 673.8 18.2 1208.0  31.1
## 17                 Mt_Vourinos      Greece  7.9 11.0 726.4 16.6  714.7  37.9
## 18       Paterzeller_Eibenwald     Germany  6.9  8.5 696.5 -0.6 1132.2  41.1
## 19                   Rascafria       Spain  5.9  9.2 647.2 14.3 1026.3  39.7
## 20     Roudsea_Wood_and_Mosses          UK  9.0  3.9 402.0 10.1 1307.6  23.8
## 21                Sainte_Baume      France 10.4  7.9 546.8 17.3  974.3  44.7
## 22             Saro_Vasterskog      Sweden  7.6  5.2 659.1  2.2  716.0  27.0
## 23           Serra_di_Fiumorbu      France 10.4  8.7 606.5 18.2 1058.8  49.6
## 24                       Sueve       Spain  8.7  5.9 436.8 14.4 1778.0  32.3
## 25             Unska_kolisevka    Slovenia  8.0  8.3 714.0  0.2 1838.3  21.7
## 26                  Valditacca       Italy  6.8  9.0 628.5 14.7 1423.6  37.4
## 27                    Visegrad      Bosnia  8.1 10.0 753.7 -0.2  987.8  20.1
## 28                      Wallis Switzerland  3.7 10.4 782.0 -0.9 1107.3  16.9
## 29                   Yewbarrow          UK  8.4  4.2 405.7  9.5 1507.8  24.5

#Past_clim_data_Europe_pop$group <- c("I","CH","SLO","N","E","E","E","UK","I","I","SLO","SK","CH","CH","GR","GR","GR","D","E","UK","F","S","F","E","SLO","I","BIH","CH","UK")
Past_clim_data_Europe_pop$group <- Past_clim_data_Europe_pop$Country

df_combined_clim <- rbind(final_df_presence_shp,Past_clim_data_Europe_pop[,-c(1,2)])

df_combined_clim_scaled <- df_combined_clim %>% dplyr::mutate(across(where(is.numeric), ~scale(.)))

7.1.3 PCA

env_PCA <- PCA(df_combined_clim_scaled[,-c(7)], scale.unit = T,
        ncp = ncol(df_combined_clim_scaled[,-c(7)]),graph=F)

eigvalues<-round(env_PCA$eig,2);eigvalues

##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1       1.90                  31.68                             31.68
## comp 2       1.62                  27.07                             58.75
## comp 3       1.12                  18.71                             77.46
## comp 4       0.75                  12.48                             89.94
## comp 5       0.37                   6.20                             96.14
## comp 6       0.23                   3.86                            100.00

barplot(eigvalues[,2], xlab="Axis",ylab="Explained variance", main= "Scatterplot")

Add name of the 29 pop

meta_data_pop$pop_abbrev <- c("VIS","BAU","SDF","PAT","CHO","VOU","OLY","ANC","PDT","UMB","VAL","BRA","HAR","GOR","BOM","UNS","BUJA","BUJG","CAR","RAS","SUE","SAV","BLI","HHR","JUR","WAL","CED","RWM","YEW")

PC_scores<-data.frame(env_PCA$ind$coord[,c(1,2)],Group=df_combined_clim_scaled$group)
explained_var <- round(env_PCA$eig[,2],1)

group_palette <- c("range-wide"="grey","Bosnia"="orangered3", "France"="gold2","Germany"= "darkorchid3", "Greece"="navyblue", "Italy"="turquoise2", "Norway"="green3", "Slovakia"="blue", "Slovenia"="red", "Spain"="black", "Sweden"="pink", "Switzerland"="darkorange", "UK"="darkgreen")

data_name <- cbind(meta_data_pop[,c(1,2,7)],PC_scores[c(1016:1044),])

loadings <- as.data.frame(env_PCA$var$coord[, c(1, 2)]) #Loadings for PC1 and PC2
colnames(loadings) <- c("Dim.1", "Dim.2") #Rename columns
loadings$Variable <- rownames(loadings)  #Add variable names

PCA_plot <- ggplot(PC_scores, aes(x = Dim.1, y = Dim.2, col = Group)) +
  geom_point(size = 2.5, alpha = 0.5) + 
  scale_color_manual(name = "Group", values = group_palette) + 
  xlab(paste0("PC 1 (", explained_var[1], "%)")) + 
  ylab(paste0("PC 2 (", explained_var[2], "%)")) +
  ggtitle("Climatic PCA") +
  geom_segment(data = loadings, aes(x = 0, y = 0, xend = Dim.1*5, yend = Dim.2*5),
               arrow = arrow(length = unit(0.2, "cm")), color = "black", size = 0.5) + 
  geom_text(data = loadings, aes(x = Dim.1, y = Dim.2, label = Variable), 
            color = "black", hjust = 0.5, vjust = -0.5, size = 3.5) +
  guides(color = guide_legend(title = "Group")) +
  theme_bw(base_size = 11) +
  theme(legend.position = "right", 
        panel.grid = element_blank(), 
        strip.text = element_text(size = 11), 
        plot.title = element_text(hjust = 0.5, color = "black", face = "italic"))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

print(PCA_plot)

pdf("C:/Users/tfrancisco/Documents/Thesis/Results/species/taxus/climate/Comparison_climate/PCA_presencepointeurope_pop_color.pdf");print(PCA_plot);dev.off()

## png 
##   2

PCA_plot <- ggplot(PC_scores, aes(x = Dim.1, y = Dim.2, col = Group)) +
  geom_point(size = 2, alpha = 0.5) + 
  scale_color_manual(name = "Group", values = group_palette) + 
  xlab(paste0("PC 1 (", explained_var[1], "%)")) + 
  ylab(paste0("PC 2 (", explained_var[2], "%)")) +
  ggtitle("Climatic PCA") +
  geom_segment(data = loadings, aes(x = 0, y = 0, xend = Dim.1*5, yend = Dim.2*5),
               arrow = arrow(length = unit(0.2, "cm")), color = "darkgreen", size = 0.5) + 
  geom_text(data = loadings, aes(x = Dim.1, y = Dim.2, label = Variable), 
            color = "darkgreen", hjust = 0.5, vjust = -0.5, size = 3.5) +
  geom_text(data=data_name, aes(x = Dim.1, y = Dim.2-0.2, label= pop_abbrev), color = "black",size = 3) +
  guides(color = guide_legend(title = "Group")) +
  theme_bw(base_size = 11) +
  theme(legend.position = "right", 
        panel.grid = element_blank(), 
        strip.text = element_text(size = 11), 
        plot.title = element_text(hjust = 0.5, color = "black", face = "italic"))

print(PCA_plot)

pdf("C:/Users/tfrancisco/Documents/Thesis/Results/species/taxus/climate/Comparison_climate/PCA_presencepointeurope_pop_name.pdf");print(PCA_plot);dev.off()

## png 
##   2

7.2 CG+SNP+SSR populations

One limitation of the EUFORGEN map is that the presence data include potential planted populations, particularly in central France, as well as populations outside the natural distribution of T. baccata, such as in northern UK. Additionally, the distribution shown in the EUFORGEN map appears continuous, whereas the natural distribution of Taxus baccata is highly fragmented. To obtain more precise presence data, we used the coordinates of the 29 SNP-sampled populations, the 26 clonal garden (CG) populations, and the 242 populations sampled for SSR, and compared these with the 29 SNP populations.

7.2.1 Gather the data

#Spain data
load("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Climatic_data/Spain_pop/Past_clim_data_Spain_pop.Rdata")

#UE data
load("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Climatic_data/new_selection/Past_clim_data_Europe_pop.Rdata")

df_2dataset <- rbind(Past_clim_data_Europe_pop[,c(1,3:8)],Past_clim_data_Spain_pop[,c(1,3:8)])

7.2.2 Extract climatic data for SSR pop

SRR_pop_data <- read.csv("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Migration/Meta_data_pop_taxus_migration.csv",h=T,sep=";",dec=",")

df_coord_data_SSR <- data.frame(Longitude=SRR_pop_data$LONG,Latitude=SRR_pop_data$LAT)