This script filters the genetic data of Taxus baccata, obtained from 501 trees from 29 populations covering nearly the entire range of T. baccata in Europe. t then generates genetic data files in the appropriate formats for subsequent analyses. Two distinct filtering strategies were applied, depending on the analysis:
For population structure analyses: no correction for minor allele count (MAC) and no imputation of missing data.
For genotype–environment association (GEA) and other analyses: correction for MAC (set to 20, corresponding to twice the number of individuals in the smallest population before filtering, which equates to a minor allele frequency of 2.11%) and imputation of missing data using the most common genotype within each main gene pool.
Single nucleotide polymorphism (SNP) calling for both datasets was conducted following GATK best practices (version 3, with –filterExpression ‘MQ < 40 || MQrank > 12.5 || DP < 7.0 || Q < 50.0 || QD < 1.5 || FS > 60.0 ||ReadPosRanksum > -8’; Auwera and O’Connor 2020). Filtering was performed with GATK on Linux (version 4.2.3.0).
he raw VCF file obtained after GATK filters contained 11,374 SNPs, 490 individuals, and 8.76% missing data.
vcf <- read.vcfR("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/genetic/filtered_T_11374SNP_490IND_ADAPCON_GENTREE.vcf", verbose = FALSE,convertNA=T)
meta_data_vcf=read.csv("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Samples/samples_taxus_baccata_adapcon_gentree.csv",h=T,sep=";",dec=",")
meta_data_pop <- read.csv("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Populations/taxus_sample_29pop.csv",h=T,sep=";",dec=",")
For population structure analyses, it is important to minimise the proportion of missing data. Therefore, we applied more stringent filters for missing data. However, because minor alleles can provide valuable information on genetic structure, we did not apply filters on minor allele count (MAC). Filtering was performed using the vcfR package.
min_dp= 7
max_dp=250
Allele_balance_min= .2
Allele_balance_max= .8
Na_snp=0.85 #% of Na per SNP, 0.85 = only SNPs with less than 15% missing data were retained.
Na_indiv=0.15#% of Na per individual, 0.15 = only individuals with less than 15% missing data were retained.
vcf_filter_structure_V2 <- vcf %>%
hard_filter(depth=min_dp ) %>%
max_depth(maxdepth =max_dp )%>%
filter_allele_balance(min.ratio=Allele_balance_min,max.ratio=Allele_balance_max)%>%
missing_by_snp(cutoff=Na_snp)%>%
missing_by_sample( cutoff = Na_indiv )
#vcfR::write.vcf(vcf_filter_structure, file = "C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/genetic_new/script_filtering(not_running_it_each_time)/bayescan.vcf.gz")
After filtering, the dataset contains:
vcf_filter_structure
## ***** Object of Class vcfR *****
## 452 samples
## 1025 CHROMs
## 8,252 variants
## Object size: 102.3 Mb
## 4.374 percent missing data
## ***** ***** *****
We performed two types of population structure analyses: STRUCTURE and principal component analysis (PCA). Because these analyses have different data input requirements, the genetic data were formatted accordingly
To use STRUCTURE, the VCF data stored in the vcfR object were transformed into the required format: the first row contains SNP names, and each subsequent row represents an individual, with each SNP split into two columns to represent the genotype (e.g., a genotype of 1/1 is coded as 1 in both columns). Because STRUCTURE requires a very specific input format, several steps in R and Ubuntu were performed to achieve this structure.
#change VCFR to dataframe
vcf_data <- vcf_filter_structure %>%
extract.gt(element="GT", IDtoRowNames = T, as.numeric=F) %>% t %>% data.frame
Genotypic information for each SNP was split into two columns using the split_columns function.
#the function
split_columns <- function(data) {
new_data <- data.frame(matrix(ncol = 0, nrow = nrow(data))) #Create an empty dataframe for the output
for (col in names(data)) { #Apply the function to each column of the dataframe
split_parts <- strsplit(as.character(data[[col]]), "/") #Split the columns based on the '/' character
max_parts <- max(sapply(split_parts, length)) #= maximum number of split parts per column and store the resulting new columns to assign names
for (i in 1:max_parts) {#Names the columns by the name of the snp + _part 1 or 2
new_column_name <- paste0(col, "_part", i)
new_data[[new_column_name]] <- sapply(split_parts, function(x) ifelse(length(x) >= i, x[i], NA))
}
}
row.names(new_data)<-row.names(data)
return(new_data)
}
data_split <- split_columns(vcf_data)
Final dataset and exporting it to a txt file
Next, we ensured that each SNP had a single name for the STRUCTURE input by extracting SNP names from the VCF data and modifying them as needed. Missing data (NA) were replaced with -9, and any quotation marks (“”) were removed from the text file using a text editor (e.g., Notepad).
Finally, because the file was created in Windows but intended for use with Linux software, we converted it to Linux text format using Ubuntu. This was done with the command dos2unix nameofdata.txt.
Performing PCA requires genotype information, which can be provided either at the individual level or aggregated at the population level.
vcf_data_no_mac_c <- vcf_filter_structure %>%
extract.gt(element="GT", IDtoRowNames = T, as.numeric=F) %>% t %>% data.frame
We also converted the dataset from nucleotide information to genotype data. This raw genotype format was then used to calculate the Omega matrix in BayPass.
#The genotype data were originally coded as 0/0, 0/1, and 1/1, and were transformed into allele counts for downstream analyses.
Data_geno_no_maf_c_8252SNP_452IND <- vcf_data_no_mac_c %>%
mutate_all(funs(str_replace(., "0/0", "0"))) %>%
mutate_all(funs(str_replace(., "0/1", "1"))) %>%
mutate_all(funs(str_replace(., "1/1", "2")))
#Save for BayPass analysis
save(Data_geno_no_maf_c_8252SNP_452IND,file="C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/genetic_new/GEA/Data_geno_no_maf_c_8252SNP_452IND.Rdata" )
We added the percentage of missing data (NA) per individual to the file in order to visualise it on the PCA plots.
#Specify SNPs as numeric to calculate the percentage of missing data per individual
Dataset_PCA_8252SNP_452IND <- data.frame(row.names(vcf_data),apply(Data_geno_no_maf_c_8252SNP_452IND, 2, as.numeric))
#Calculation of NA %
na_percentage_indiv <- rowMeans(is.na(Dataset_PCA_8252SNP_452IND[,-c(1)])) * 100
Save the prepared file for PCA analysis
Dataset_PCA_8252SNP_452IND$na_percentage_indiv <- na_percentage_indiv
names(Dataset_PCA_8252SNP_452IND)[names(Dataset_PCA_8252SNP_452IND) == 'row.names.vcf_data.'] <- 'VCF_ID'
#Save the genetic dataset
save(Dataset_PCA_8252SNP_452IND,file="C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/genetic_new/structure/Dataset_PCA_8252SNP_452IND.Rdata")
#The VCF_ID was added to the dataset, although it is normally present in the imputed file.
Gen_matrix_non_imputed_T_adapcon_gentree_8252_452 <- Dataset_PCA_8252SNP_452IND[,-8254]
#Data for allelic frequencies
data_frequencies <-merge(meta_data_vcf,Gen_matrix_non_imputed_T_adapcon_gentree_8252_452,"VCF_ID");row.names(data_frequencies) <- data_frequencies$VCF_ID
#Formatting the genomic data
data_frequencies_num <- data_frequencies[,-c(1:3)] %>% #keeping only the snps
apply(2,as.numeric) /2 #we divided by 2 because of the format of genomic data: 0,1,2 and for allelic frequencies and we want 0,0.5, 1
#The dataset was prepared with all relevant genomic information in the correct format
data_frequencies_num_tot <- data.frame(data_frequencies[,c(1:3)],data_frequencies_num)
#Calculation of allele frequencies
allelic_frequencies <-data_frequencies_num_tot %>% dplyr::select(-c("VCF_ID","Country","Population")) %>% #Non-genomic data were removed
group_by(data_frequencies_num_tot$Population) %>% #To obtain population-level allele frequencies, individuals were grouped by population
summarise_at(vars(everything()),funs(mean),na.rm=T) %>% #mean allele frequency for each SNP per population was calculated
ungroup() %>%
as.data.frame()
Save the dataset of allelic frequencies
#Move population level to row.names
data_allelic_frequencies_non_imp_T_adapcon_gentree_452_8252snps <- allelic_frequencies %>% dplyr::select(-c("data_frequencies_num_tot$Population")); row.names(data_allelic_frequencies_non_imp_T_adapcon_gentree_452_8252snps) <- allelic_frequencies$'data_frequencies_num_tot$Population'
#Save
save(data_allelic_frequencies_non_imp_T_adapcon_gentree_452_8252snps,file="C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/genetic_new/data_allelic_frequencies_non_imp_T_adapcon_gentree_452_8252snps.Rdata")
For genotype–environment association (GEA) and other landscape genomic analyses, we applied less stringent filtering for missing data, as the accuracy of these approaches seems to improve with an increased number of SNPs and individuals (Santos & Gaiotto, 2020). Remaining missing data were imputed using the most common genotype within each main gene pool identified by STRUCTURE clustering (Pritchard et al., 2000) and evaluated with the Evanno method (Evanno et al., 2005), because most GEA analyses do not support missing data. Additionally, to try to reduce the risk of false positives associated with low-frequency alleles, all SNPs with a minor allele count below twice the number of individuals in the smallest population prior to filtering were removed.
min_dp= 7
max_dp=250
Allele_balance_min= .2
Allele_balance_max= .8
mac=20
Na_snp=0.7 #% of Na per SNP, 0.70 = only SNPs with less than 30% missing data were retained.
Na_indiv=0.3#% of Na per individual, 0.30 = only individuals with less than 30% missing data were retained.
vcf_filter_GEA <- vcf %>%
hard_filter(depth= min_dp) %>%
max_depth(maxdepth =max_dp )%>%
filter_allele_balance(min.ratio=Allele_balance_min,max.ratio=Allele_balance_max) %>%
min_mac(min.mac = mac) %>%
missing_by_snp(cutoff=Na_snp) %>%
missing_by_sample( cutoff = Na_indiv)
Load file
After filtering, we have:
vcf_filter_GEA
## ***** Object of Class vcfR *****
## 475 samples
## 1031 CHROMs
## 8,616 variants
## Object size: 118.3 Mb
## 8.523 percent missing data
## ***** ***** *****
We converted the VCF data to the format required for GEA analyses. First, only the genotype information was retained from the vcfR object.
vcf_data_GEA <- vcf_filter_GEA %>%
extract.gt(element="GT", IDtoRowNames = T, as.numeric=F) %>% t %>% data.frame
The dataset was also converted from nucleotide information to genotype format.
#The format is in 0/0, 0/1 and 1/1 so we transform this in allele count
genetic_data_geno <- vcf_data_GEA %>%
mutate_all(funs(str_replace(., "0/0", "0"))) %>%
mutate_all(funs(str_replace(., "0/1", "1"))) %>%
mutate_all(funs(str_replace(., "1/1", "2")))
Missing data were imputed using the most common genotype within each main gene pool.
#First, a main_gene_pool label was added to each individual based on STRUCTURE results, using K = 3 to define three main gene pools.
meta_data_pop$main_gene_pool <- c("Eastern_pool","Western_pool","Western_pool","Central_pool","Eastern_pool","Eastern_pool","Eastern_pool","Western_pool","Western_pool","Western_pool","Central_pool","Western_pool","Eastern_pool","Eastern_pool","Eastern_pool","Eastern_pool","Western_pool","Western_pool","Western_pool","Western_pool","Western_pool","Central_pool","Central_pool","Central_pool","Central_pool","Central_pool","Western_pool","Western_pool","Western_pool")
#Merge info
meta_data_tot <- merge(meta_data_vcf,meta_data_pop,"Population")
#Imputation steps
##Keep ID info
gen_grouped <- genetic_data_geno %>% tibble::rownames_to_column(var = "VCF_ID")#We put the ID of samples in a column
list_indiv_427 <- list(gen_grouped$VCF_ID)
##A metadata file was created containing only the individuals retained after the filtering steps.
meta_data_vcf_imputation <- meta_data_tot[meta_data_tot$VCF_ID %in% gen_grouped$VCF_ID,]
##We need to specify that SNPs are numeric
gen_grouped_num <- data.frame(gen_grouped[,c("VCF_ID")],apply(gen_grouped[,c(2:8617)], 2, as.numeric))
##Name of VCF_ID to merge
names(gen_grouped_num)[names(gen_grouped_num) == 'gen_grouped...c..VCF_ID...'] <- 'VCF_ID'
##Total dataset
gen_grouped_num_tot <- merge(meta_data_vcf_imputation[,-c(4:8)],gen_grouped_num,"VCF_ID")
##Perform the imputation by transforming the NAs into the most common genotype per gene pool:
gen_imp <- gen_grouped_num_tot %>%#Gene pool
group_by(main_gene_pool) %>% #Separate the table to organize it per main gene pool and apply the function below at the gene pool level
apply(2, function(x) replace(x, is.na(x), as.numeric(names(which.max(table(x)))))) %>% #Calculate the most common genotype and fill the Na with that
as.data.frame() %>%
ungroup()
##Convert ID column back to row names
rownames(gen_imp) <- gen_imp$VCF_ID
gen_matrix <- gen_imp[,-c(1:4)] # Remove ID,country and main gene pool columns from data frame
##We have an issues with space added during imputation, we removed them:
library(stringr)
Gen_matrix_imp_T_Adapcon_Gentree_475_8616 <- gen_matrix %>%
mutate(across(where(is.character), str_trim))
##Check the proportion of missing data before and after imputation
prop.table(table(is.na(gen_grouped_num_tot)))
##
## FALSE TRUE
## 0.91481084 0.08518916
prop.table(table(is.na(Gen_matrix_imp_T_Adapcon_Gentree_475_8616)))
##
## FALSE
## 1
Two types of genomic datasets were saved for downstream
analyses:
- individual-level genomic data
- population-level genomic data
Both datasets can serve as input for GEA or other genomic analyses, depending on our goalsand study design. In this study, we used the population-level dataset.
This dataset represents the output after the imputation steps.
#Save the dataframes to load it in further scripts to avoid going through this script every time
save(Gen_matrix_imp_T_Adapcon_Gentree_475_8616,file="C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/genetic_new/Gen_matrix_imp_T_Adapcon_Gentree_475_8616.Rdata")
#We also saved a version with the nucleotide format : 0/0, 0/1 and 1/1 for baypass
Gen_matrix_imp_T_Adapcon_Gentree_475_8616_BAYPASS <- Gen_matrix_imp_T_Adapcon_Gentree_475_8616 %>%
mutate_all(funs(str_replace(.,"0","0/0"))) %>%
mutate_all(funs(str_replace(., "1","0/1"))) %>%
mutate_all(funs(str_replace(., "2", "1/1")))
save(Gen_matrix_imp_T_Adapcon_Gentree_475_8616_BAYPASS,file="C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/genetic_new/GEA/Gen_matrix_imp_T_Adapcon_Gentree_475_8616_BAYPASS.Rdata")
To perform analyses at the population level, allele frequencies were calculated for each SNP within each population.
#Add VCF_ID but normally it should be in the imputated file
Gen_matrix_imp_T_Adapcon_Gentree_475_8616$VCF_ID <- row.names(Gen_matrix_imp_T_Adapcon_Gentree_475_8616)
#Metadata
meta_data_vcf_475 <- meta_data_vcf[meta_data_vcf$VCF_ID %in% row.names(Gen_matrix_imp_T_Adapcon_Gentree_475_8616),]
#Data for allelic frequencies
data_frequencies <-merge(meta_data_vcf_475,Gen_matrix_imp_T_Adapcon_Gentree_475_8616,"VCF_ID");row.names(data_frequencies) <- data_frequencies$VCF_ID
#Formatting the genomic data
data_frequencies_num <- data_frequencies[,-c(1:3)] %>% #Keeping only the snps
apply(2,as.numeric) /2 #we divided by 2 because of the format of genomic data: 0,1,2 and for allelic frequencies and we want 0,0.5, 1
#Dataset with all information and genomic data in the right format
data_frequencies_num_tot <- data.frame(data_frequencies[,c(1:3)],data_frequencies_num)
#Calculation of allelic frequencies
allelic_frequencies <-data_frequencies_num_tot %>% dplyr::select(-c("VCF_ID","Country","Population")) %>% #Remove non genomic data from the dataset
group_by(data_frequencies_num_tot$Population) %>% #We wanted the allelic frequencies at the population level so we grouped
summarise_at(vars(everything()),funs(mean),na.rm=T) %>% #Calculate the mean for each snp per pop
ungroup() %>%
as.data.frame()
Save the allelic frequencies dataset
#Move population level to row.names
data_allelic_frequencies_29pop_adapcon_gentree_475_8616 <- allelic_frequencies %>% dplyr::select(-c("data_frequencies_num_tot$Population")); row.names(data_allelic_frequencies_29pop_adapcon_gentree_475_8616) <- allelic_frequencies$'data_frequencies_num_tot$Population'
#Save
save(data_allelic_frequencies_29pop_adapcon_gentree_475_8616,file="C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/genetic_new/data_allelic_frequencies_29pop_adapcon_gentree_475_8616.Rdata")
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.3.2 (2023-10-31 ucrt)
## os Windows 11 x64 (build 26100)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate French_France.utf8
## ctype French_France.utf8
## tz Europe/Paris
## date 2025-09-22
## pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## ape 5.7-1 2023-03-13 [1] CRAN (R 4.3.2)
## bslib 0.6.1 2023-11-28 [1] CRAN (R 4.3.2)
## cachem 1.0.8 2023-05-01 [1] CRAN (R 4.3.1)
## cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.1)
## cluster 2.1.6 2023-12-01 [2] CRAN (R 4.3.2)
## colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.1)
## devtools 2.4.5 2022-10-11 [1] CRAN (R 4.3.3)
## digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.1)
## dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.2)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.3.1)
## evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.2)
## fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.1)
## fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.1)
## generics 0.1.4 2025-05-09 [1] CRAN (R 4.3.2)
## ggplot2 * 3.5.2 2025-04-09 [1] CRAN (R 4.3.2)
## glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.2)
## gtable 0.3.6 2024-10-25 [1] CRAN (R 4.3.3)
## htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.2)
## htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.3.2)
## httpuv 1.6.13 2023-12-06 [1] CRAN (R 4.3.2)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.3.1)
## jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.2)
## knitr 1.45 2023-10-30 [1] CRAN (R 4.3.2)
## later 1.3.2 2023-12-06 [1] CRAN (R 4.3.2)
## lattice 0.22-5 2023-10-24 [2] CRAN (R 4.3.2)
## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.2)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.1)
## MASS 7.3-60.0.1 2024-01-13 [2] CRAN (R 4.3.2)
## Matrix 1.6-5 2024-01-11 [1] CRAN (R 4.3.2)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.3.1)
## memuse 4.2-3 2023-01-24 [1] CRAN (R 4.3.1)
## mgcv 1.9-1 2023-12-21 [2] CRAN (R 4.3.2)
## mime 0.12 2021-09-28 [1] CRAN (R 4.3.1)
## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.2)
## munsell 0.5.1 2024-04-01 [1] CRAN (R 4.3.3)
## nlme 3.1-164 2023-11-27 [2] CRAN (R 4.3.2)
## permute 0.9-7 2022-01-27 [1] CRAN (R 4.3.2)
## pillar 1.10.2 2025-04-05 [1] CRAN (R 4.3.3)
## pinfsc50 1.3.0 2023-12-05 [1] CRAN (R 4.3.2)
## pkgbuild 1.4.3 2023-12-10 [1] CRAN (R 4.3.2)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.2)
## pkgload 1.3.4 2024-01-16 [1] CRAN (R 4.3.2)
## plyr 1.8.9 2023-10-02 [1] CRAN (R 4.3.2)
## profvis 0.3.8 2023-05-02 [1] CRAN (R 4.3.2)
## promises 1.2.1 2023-08-10 [1] CRAN (R 4.3.2)
## purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.2)
## R6 2.6.1 2025-02-15 [1] CRAN (R 4.3.3)
## RColorBrewer * 1.1-3 2022-04-03 [1] CRAN (R 4.3.1)
## Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.1)
## remotes 2.5.0 2024-03-17 [1] CRAN (R 4.3.3)
## reshape2 * 1.4.4 2020-04-09 [1] CRAN (R 4.3.2)
## rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.2)
## rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1)
## rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.2)
## sass 0.4.8 2023-12-06 [1] CRAN (R 4.3.2)
## scales 1.3.0 2023-11-28 [1] CRAN (R 4.3.2)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.2)
## shiny 1.8.0 2023-11-17 [1] CRAN (R 4.3.2)
## SNPfiltR * 1.0.1 2023-03-17 [1] CRAN (R 4.3.2)
## stringi 1.8.3 2023-12-11 [1] CRAN (R 4.3.2)
## stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.3.2)
## tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.2)
## tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.3.2)
## tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.3.3)
## urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.2)
## usethis 2.2.3 2024-02-19 [1] CRAN (R 4.3.2)
## vcfR * 1.15.0 2023-12-08 [1] CRAN (R 4.3.2)
## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.2)
## vegan 2.6-4 2022-10-11 [1] CRAN (R 4.3.2)
## viridisLite 0.4.2 2023-05-02 [1] CRAN (R 4.3.2)
## withr 3.0.2 2024-10-28 [1] CRAN (R 4.3.3)
## writexl * 1.5.0 2024-02-09 [1] CRAN (R 4.3.2)
## xfun 0.41 2023-11-01 [1] CRAN (R 4.3.2)
## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.2)
## yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.2)
##
## [1] C:/Users/tfrancisco/AppData/Local/R/win-library/4.3
## [2] C:/Users/tfrancisco/AppData/Local/Programs/R/R-4.3.2/library
##
## ──────────────────────────────────────────────────────────────────────────────
what is below is a draft, not useful in this study, could be use for imputation at the gene pool level if we imputed some gene pool per multiple gene pools