rm(list = ls())
knitr::opts_chunk$set(
    message = FALSE,
    warning = FALSE,
    cache = FALSE
)

1 Introduction

This script performs candidate detection using the LFMM method (Frichot et al. 2013). The workflow is based on the LFMM tutorial by Frichot and François and the tutorial by Brenna R. Forester.

LFMM can be applied in either a univariate or multivariate framework, allowing SNP-environment associations to be tested individually or jointly. The method can account for population genetic structure or be run without correction. It is a linear approach, as it assumes a linear relationship between loci and climatic variables.

LFMMs are regression models that combine fixed effects (environmental variables) with latent effects (hidden factors capturing background population structure) (Caye et al. 2019). Latent factors are estimated using a factorisation approach similar to PCA, which explains why latent factors often align with principal component axes (Frichot et al. 2013).

Model overview: environmental variables (primary predictors) are tested against genetic variation while correcting for population structure. Populations that are closer in latent space are expected to be more genetically similar, whereas those that are further apart are less similar, suggesting additional drivers of genetic variation. The effect of population structure is accounted for simultaneously during coefficient estimation, thereby reducing confounding between structure and environment. This approach also reduces the false negative rate.

There are two types of LFMM implementations:
lfmm(), a Bayesian method based on a Markov Chain Monte Carlo (MCMC) algorithm.
lfmm2(), a frequentist method using least-squares estimation, recommended for large datasets (1,000-10,000 SNPs) because it is faster while producing results comparable to the Bayesian approach.

2 Data

Genomic data must be available as individual-level allele counts (for the old version).

#climatic data at individual level
load("Data/Species/Taxus_baccata/Climatic_data/new_selection/climatic_data_indivdual_level_scaled_new_selection_df.Rdata")
climatic_data <- climatic_data_indivdual_level_scaled_new_selection_df

load("Data/Species/Taxus_baccata/genetic_new/Gen_matrix_imp_T_Adapcon_Gentree_475_8616.Rdata")
#genomic data in numeric
genomic_data <- Gen_matrix_imp_T_Adapcon_Gentree_475_8616 %>% mutate_all( function(x) as.numeric(as.character(x)))

#meta_data
meta_data_pop <- read.csv("Data/Species/Taxus_baccata/Populations/taxus_sample_29pop.csv",h=T,sep=";",dec=",")
#alphabetic order
meta_data_pop_order <- meta_data_pop[order(meta_data_pop$Population),]

meta_data_vcf <- read.csv("Data/Species/Taxus_baccata/samples/samples_taxus_baccata_adapcon_gentree.csv",h=T,sep=";",dec=",")
genomic_data$VCF_ID <- row.names(genomic_data)
df <- merge(meta_data_vcf,genomic_data,"VCF_ID")

We need to fit the climate data to the number of individuals in the genomic data.

climatic_data_475 <- climatic_data[climatic_data$VCF_ID %in% row.names(Gen_matrix_imp_T_Adapcon_Gentree_475_8616),]
climate_format <- climatic_data_475[,-c(1:3)]

3 Run LFMM model

3.1 Number of latent factors

Based on the PCA of the genetic data, the first two principal components appear sufficient to capture population structure, as they discriminate the main gene pools identified by STRUCTURE. However, because latent factors do not necessarily behave in the same way as principal components, we performed a latent factor analysis to determine how many latent factors are required to properly separate the three main gene pools.

LFMM2 model

Here we have chosen to use the LFMM2 algorithm because it’s faster with larger data sets.

Run_LFMM2 <- lfmm2(input = genomic_data[,-8617], env = climatic_data_475[,-c(1:3)], K = 2, effect.sizes = T)
# GEA test
score_LFMM <- data.frame(Run_LFMM2@U)

score_Pca <- data.frame(score_LFMM, row.names (genomic_data))
      colnames(score_Pca) <- c(paste0("PC1"), paste0("PC2"), "VCF_ID")
      #add pop info on the score_pca
  score_Pca_meta_data <- merge(climatic_data_475[,c(1:2)],score_Pca, "VCF_ID")
      
      PCa_df_score <- merge(score_Pca_meta_data, meta_data_pop_order, "Population")
      
latent_factor_LFMM2_candidates_selection <- ggplot() +
  geom_point(data = PCa_df_score, aes(PC1, PC2, color = Country),size=2.6) +
  scale_colour_manual(name = "Countries",
                      values = c("orangered3", "gold2", "darkorchid3", "navyblue", "turquoise2", "green3", "blue", "red", "black", "gray", "orange", "darkgreen")) +
  xlab("Latent factor 1") + 
  ylab("Latent factor 2") +
  guides(color = guide_legend(title="Country",override.aes = list(size = 3.5)))+ 
  theme_bw(base_size = 12) +
  theme(
    panel.grid = element_blank(), 
    plot.background = element_blank(), 
    panel.background = element_blank(), 
    strip.text = element_text(size = 12),
    legend.text = element_text(size = 13),
    legend.title = element_text(size = 14),
    axis.title = element_text(size = 13)
  )
print(latent_factor_LFMM2_candidates_selection)

3.2 Pvalues calculation and pvalues distribution

#non correcting for GIF
pv_non_corrected <- lfmm2.test(object = Run_LFMM2,
input = genomic_data[,-8617],
env = climatic_data_475[,-c(1:3)],
full = T,
genomic.control = F)

#correcting for GIF
pv_corrected <- lfmm2.test(object = Run_LFMM2,
input = genomic_data[,-8617],
env = climatic_data_475[,-c(1:3)],
full = T,
genomic.control = T)

The next step is to visualise the p-value distribution with and without correction by the genomic inflation factor (GIF). The GIF expresses the deviation of the observed test statistic distribution compared to the expected distribution (Van den Berg et al. 2019).

High GIF -> insufficient correction for population stratification, leading to an inflated false positive rate.
Low GIF -> overcorrection, resulting in loss of signal and an increased false negative rate.
GIF = 1 -> appropriate correction.

To assess the most appropriate GIF for our data, we can examine the shape of the p-value distribution. Ideally, this distribution should show a strong peak near 0, followed by a smooth, uniform distribution from 0.1 to 1. We will compare the distributions obtained with and without GIF correction.

#non corrected for GIF
Histogram_of_non_calibrated_Pvalues_LFMM<- hist(pv_non_corrected$pvalues,
     main= "Histogram of non-calibrated P-values",
     xlab= "P-values")

#correcting for GIF
Histogram_of_calibrated_Pvalues_LFMM<-hist(pv_corrected$pvalues,
     main= "Histogram of calibrated P-values",
     xlab= "P-values")

#gif values
pv_corrected$gif

## [1] 2.423337

4 Candidate selection

4.1 Thresholds

We used the false discovery rate (FDR) threshold of 5% and a more relaxed threshold of 10% depending on the species (see Supplementary Methods)

df_pvalues_calibrated <- data.frame(SNP=colnames(genomic_data[,-8617]),pvalues=pv_corrected$pvalues)

#FDR correction
candidates_FDR <- data.frame(snp_names=colnames(genomic_data[,-8617]) ,qvalues=qvalue(pv_corrected$pvalues)$qvalues)

#threshold 0.05
thres_FDR <- 0.05

candidates_T_adapcon_gentree_LFMM_5perc <- data.frame(SNP=candidates_FDR$snp_names[which(candidates_FDR$qvalues<thres_FDR)],qvalues = candidates_FDR$qvalues[which(candidates_FDR$qvalues<thres_FDR)])

length(which(candidates_FDR$qvalues < thres_FDR))

## [1] 14

#FDR 10% 
thres_FDR <- 0.1
candidates_T_adapcon_gentree_LFMM_10perc <- data.frame(SNP=candidates_FDR$snp_names[which(candidates_FDR$qvalues<thres_FDR)],qvalues = candidates_FDR$qvalues[which(candidates_FDR$qvalues<thres_FDR)])

length(which(candidates_FDR$qvalues < thres_FDR))

## [1] 24

4.2 Plot

We plotted the candidates in a Manhattan plot with a false discovery rate (FDR) threshold of 5%.

#selection of the candidates from FDR 5%
df_pvalues_calibrated$type <- "Other SNPs"
df_pvalues_calibrated$type[df_pvalues_calibrated$SNP%in%candidates_T_adapcon_gentree_LFMM_5perc$SNP] <- "Candidate SNP FDR 5%"
df_pvalues_calibrated$type <- as.factor(df_pvalues_calibrated$type)

#Bonferroni threshold
threshold_bonferroni <- 0.05/nrow(df_pvalues_calibrated)

#plot
Manhattan_plot_LFMM_FDR_5perc <- ggplot(df_pvalues_calibrated) +
  geom_point(aes(x=1:nrow(df_pvalues_calibrated), y= -log10(pvalues), col = type), size=1.4) +
  scale_color_manual(values = c("orange","lightblue")) +
  xlab("Loci") + ylab("-log10(p.values)") +
  geom_hline(yintercept= -log10(threshold_bonferroni), linetype="dashed", color = "red", size=0.6) +
  ggtitle("Manhattan plot LFMM, with FDR 5% threshold") +
  guides(color=guide_legend(title="SNP type")) +
  theme_bw(base_size = 11) +
  theme(legend.position="right",
    panel.grid = element_blank(), 
    plot.background = element_blank(), 
    plot.title = element_text(hjust = 0.5,color = "Black",face="italic"),
    panel.background = element_blank(), 
    strip.text = element_text(size = 12),
    legend.text = element_text(size = 13),
    legend.title = element_text(size = 13),
    axis.title = element_text(size = 13)
)

plot(Manhattan_plot_LFMM_FDR_5perc)

We plotted the candiate SNPs for a 10% threshold FDR

#selection of the candidates from FDR 5 and 10%
df_pvalues_calibrated$type <- "Other SNPs" 
df_pvalues_calibrated$type[df_pvalues_calibrated$SNP%in%candidates_T_adapcon_gentree_LFMM_10perc$SNP] <- "Candidate SNPs FDR 10%"
df_pvalues_calibrated$type <- as.factor(df_pvalues_calibrated$type)

#Bonferroni threshold
threshold_bonferroni <- 0.05/nrow(df_pvalues_calibrated)

#plot
Manhattan_plot_LFMM_FDR_10perc <- ggplot(df_pvalues_calibrated) +
  geom_point(aes(x=1:nrow(df_pvalues_calibrated), y=-log10(pvalues), col = type), size=2.5) +
  scale_color_manual(values = c("orange","lightblue")) +
  xlab("Loci") + ylab("-log10(p.values)") +
  geom_hline(yintercept=-log10(threshold_bonferroni), linetype="dashed", color = "red", size=0.6) +
  ggtitle("Manhattan plot LFMM, with FDR 10% threshold") +
     guides(color = guide_legend(title="SNP type",override.aes = list(size = 3)))+ # Increase legend point size
  theme_bw(base_size = 14) +
 theme(legend.position="right",
    panel.grid = element_blank(), 
    plot.background = element_blank(), 
    plot.title = element_text(hjust = 0.5,color = "Black",face="italic"),
    panel.background = element_blank(), 
    strip.text = element_text(size = 14),
    legend.text = element_text(size = 13),
    legend.title = element_text(size = 14),
    axis.title = element_text(size = 13)
)
plot(Manhattan_plot_LFMM_FDR_10perc)

We also We plotted the candidates in a Manhattan plot with a false discovery rate (FDR) threshold of 5% and 10%.

#selection of the candidates from FDR 5 and 10%
df_pvalues_calibrated$type <- "Other SNPs" 
df_pvalues_calibrated$type[df_pvalues_calibrated$SNP%in%candidates_T_adapcon_gentree_LFMM_10perc$SNP] <- "Candidate SNPs FDR 10%"
df_pvalues_calibrated$type[df_pvalues_calibrated$SNP%in%candidates_T_adapcon_gentree_LFMM_5perc$SNP] <- "Candidate SNPs FDR 5%"
df_pvalues_calibrated$type <- as.factor(df_pvalues_calibrated$type)

#Bonferroni threshold
threshold_bonferroni <- 0.05/nrow(df_pvalues_calibrated)

#plot
Manhattan_plot_LFMM_FDR_5_10perc <- ggplot(df_pvalues_calibrated) +
  geom_point(aes(x=1:nrow(df_pvalues_calibrated), y=-log10(pvalues), col = type), size=1.4) +
  scale_color_manual(values = c("darkgreen","orange","lightblue")) +
  xlab("Loci") + ylab("-log10(p.values)") +
  geom_hline(yintercept=-log10(threshold_bonferroni), linetype="dashed", color = "red", size=0.6) +
  ggtitle("Manhattan plot LFMM, with FDR 5 and 10% threshold") +
  guides(color=guide_legend(title="SNP type")) +
  theme_bw(base_size = 11) +
  theme(legend.position="right",
    panel.grid = element_blank(), 
    plot.background = element_blank(), 
    plot.title = element_text(hjust = 0.5,color = "Black",face="italic"),
    panel.background = element_blank(), 
    strip.text = element_text(size = 12),
    legend.text = element_text(size = 13),
    legend.title = element_text(size = 13),
    axis.title = element_text(size = 13)
)
plot(Manhattan_plot_LFMM_FDR_5_10perc)

4.3 Save candidates

We need to save the identified candidates for downstream analysis

#FDR 5%
write_xlsx(candidates_T_adapcon_gentree_LFMM_5perc,"Results/species/taxus/GEA_new_var/outliers/candidates_T_adapcon_gentree_LFMM_5perc.xlsx")
save(candidates_T_adapcon_gentree_LFMM_5perc, file="Results/species/taxus/GEA_new_var/outliers/candidates_T_adapcon_gentree_LFMM_5perc.Rdata")

#FDR 10%
save(candidates_T_adapcon_gentree_LFMM_10perc, file="Results/species/taxus/GEA_new_var/outliers/candidates_T_adapcon_gentree_LFMM_10perc.Rdata")

5 Corrected genetic matrix for GF correted analyses

To perform outlier detection corrected for population structure using Gradient Forest (GF), we need to use a genomic matrix that has already been corrected for population structure, as GF cannot correct for it. To compute this corrected matrix, we followed Archambeau et al. (2025) and used the LFMM approach to obtain the corrected genotypic matrix.
Below are the explained LFMM2 models of Caye et al. (2019):

B, U and V are the effect sizes and the factor and loading matrices adjusted by the LFMM2 algorithm from the set of current environmental variables included in the matrix X. B is a matrix of dimension p × b where p is the number of genetic markers and b is the number of environmental variables. U is a matrix of dimension n × K where n is the number of individuals (i.e. genotypes) and K is the number of latent factors. V is a matrix of dimension p x K. X is a matrix of dimension n x b. Yfut is a matrix of dimension n x p.

We want a matrix of allele frequencies corrected for population structure:
Ycorrected = Yfut - UVt = XBt

Below we have performed a matrix multiplication of the matrix X (dimension n x b) and the transposition of the matrix B (dimension b x p) to obtain the matrix Ycorrect (dimension n x p) as described in Archambeau et al 2025.

The matrix B is the matrix from the lfmm2 output

# matrix X where we x by the matrix B. 
Genomic_matrix_corrected_from_LFMM_T_adapcon_gentree <- as.matrix(climatic_data_475[,-c(1:3)]) %*% t(Run_LFMM2@B) %>% set_rownames(row.names(genomic_data)) %>% set_colnames(colnames(genomic_data)) %>% as.data.frame()

We save the corrected genotypic matrix for GF analysis

Session info

packages <- c(
  "lfmm", "qvalue", "LEA", "ggplot2",
  "writexl", "DescTools", "magrittr", "dplyr"
)
info <- sessionInfo()
packages_used <- info$otherPkgs[packages]
data.frame(
  Package = names(packages_used),
  Version = sapply(packages_used, function(x) x$Version)
)

##             Package Version
## lfmm           lfmm     1.1
## qvalue       qvalue  2.34.0
## LEA             LEA  3.14.0
## ggplot2     ggplot2   3.5.2
## writexl     writexl   1.5.0
## DescTools DescTools 0.99.54
## magrittr   magrittr   2.0.3
## dplyr         dplyr   1.1.4

LFMM_candidate_detection

Thomas Francisco

2024-04-30