R - Exploratory Factor Analysis
Factor analysis is a statistical method used to search for some unobserved variables called factors from observed variables called factors. There are two variations: exploratory factor analysis (EFA) and confirmatory factor analysis (CFA).
We will review the preliminary steps to factor analysis including examining the data and the assumptions required for factor analysis and how to determine the number of factors to retain.
In R we need to use the packages psych
link, corrplot
linkand rms
link . For plotting we use ggplot2
.
library(psych)
library(corrplot)
library(rms) # used for vif
library(ggplot2)
1. Describing Data
Fetching data from the server
We first need to fetch sample data from the server. This is a raw data set, so each row row represents a person’s survery.
url <- "https://raw.githubusercontent.com/housecricket/data/main/efa/sample1.csv"
data_survey <- read.csv(url, sep = ",")
Describing the data
We also look at the dataset before we run any analysis.
describe(data_survey)
We use the dim function to retrieve the dimension of the dataset.
dim(data_survey)
Cleaning data
In our data frame, we have an ID variable in the first column. So, we can use a -1 in the column index to remove the first column and save our data to a new object.
data <- data_survey[ , -1]
head(data)
Correlation matrix
We also should take a look at the correlations among our variables to determine if factor analysis is appropriate.
datamatrix <- cor(data[,c(-13)])
corrplot(datamatrix, method="number")
2. Factorability of the Data
X <- data[,-c(13)]
Y <- data[,13]
KMO
The Kaiser-Meyer-Olkin (KMO) used to measure sampling adequacy is a better measure of factorability.
KMO(r=cor(X))
According to Kaiser’s 1974 guidelines (Kaiser, H.F. An index of factorial simplicity. Psychometrika 39, 31–36 (1974)), a suggested cutoff for determining the factorability of the sample data is KMO ≥ 60. The total KMO is 0.83, indicating that, based on this test, we can probably conduct a factor analysis.
Bartlett’s Test of Sphericity
cortest.bartlett(X)
Small values (8.84e-290 < 0.05) of the significance level indicate that a factor analysis may be useful with our data.
det(cor(X))
We have a positive determinant, which means the factor analysis will probably run.
3. Number of Factors to Extract
Scree Plot
fafitfree <- fa(data, nfactors = ncol(X), rotate = "none")
n_factors <- length(fafitfree$e.values)
scree <- data.frame(
Factor_n = as.factor(1:n_factors),
Eigenvalue = fafitfree$e.values)
ggplot(scree, aes(x = Factor_n, y = Eigenvalue, group = 1)) +
geom_point() + geom_line() +
xlab("Number of factors") +
ylab("Initial eigenvalue") +
labs( title = "Scree Plot",
subtitle = "(Based on the unreduced correlation matrix)")+
theme_minimal()
Parallel Analysis
We can use the parallel() function from the nFactors package (Raiche & Magis, 2020) to perform a parallel analysis.
parallel <- fa.parallel(X)
Parallel analysis suggests that the number of factors = 4 and the number of components = 3
4. Conducting the Factor Analysis
Factor analysis using fa method
fa.none <- fa(r=X,
nfactors = 4,
# covar = FALSE, SMC = TRUE,
fm="pa", # type of factor analysis ("pa" is principal axis factoring)
max.iter=100, # (default is 50)
rotate="varimax") # none rotation
print(fa.none)
Factor analysis using the factanal method
factanal.none <- factanal(X, factors=4,
scores = c("regression"),
rotation = "varimax")
print(factanal.none)
Graph Factor Loading Matrices
fa.diagram(fa.none)
5. Regression analysis
Scores for all the rows
head(fa.var$scores)
Labeling the data
regdata <- cbind(data["QD"], fa.var$scores)
#Labeling the data
names(regdata) <- c("QD", "F1", "F2", "F3", "F4")
head(regdata)
Splitting the data to train and test set
#Splitting the data 70:30
#Random number generator, set seed.
set.seed(100)
indices= sample(1:nrow(regdata), 0.7*nrow(regdata))
train=regdata[indices,]
test = regdata[-indices,]
Regression Model using train data
model.fa.score = lm(Satisfaction~., train)
summary(model.fa.score)
Our model equation can be written as: Y = 3.55309 + 0.72628 x F1 + 0.29138 x F2 + 0.06935 x F3 + 0.62753 x F4
Check Variance Inflation Factor
vif(model.fa.score)
Check prediction of the model in the test dataset
#Model Performance metrics:
pred_test <- predict(model.fa.score, newdata = test, type = "response")
pred_test
test$QD_Predicted <- pred_test
head(test[c("QD","QD_Predicted")], 10)