An LSTM network to predict student learning

IJAIED_Excerpt

This is an excerpt from the code for my study currently in press with the International Journal of Artificial Intelligence in Education.

library(keras)
library(R6)

Part 1: Data Augmentation

The data for this LSTM network was highly limited (about 1000 total observations). When I originally trained on only the original data, the network failed to converge. This section of code creates ten copies of every observation and randomly replaced 10% of the words in the duplicate observations with a reserved character. (In this case, 0). I originally wrote this code some three years ago now, and a little tidyverse would go a long way here, but I wanted to include this mostly to make clear the technique I used.

#####4: Augment Data (Data noising: https://arxiv.org/abs/1703.02573 )#####
dim(train.text)
alpha.p<-0.1
n.aug<-10
aug.dim<-1:nrow(train.text)

for(i in 1:nrow(train.text)){
  aug.dim[i]<-round(length(which(train.text[i,]!=0))*0.2)
  if(aug.dim[i]==0){aug.dim[i]<-1}
}

aug.matrix<-matrix(nrow=(n.aug*nrow(train.text)),ncol=max_len)
aug.asgn<-as.data.frame(matrix(nrow=(n.aug*nrow(assignment_type_train)),ncol=7))
aug.out<-rep(0,times=n.aug*nrow(train.text))

row<-1
asgn.df<-as.data.frame(assignment_type_train)
for(i in 1:nrow(aug.matrix)){
  samp<-sample(1:200,200*alpha.p,replace = FALSE)
  aug.matrix[i,]<-train.text[row,]
  aug.matrix[i,samp]<-0
  aug.asgn[i,]<-asgn.df[row,]
  aug.out[i]<-train.scores[row]
  if(i!=1 & i%%n.aug==0){
    row<-row+1
  }
}

names(assignment_type_train)<-names(aug.asgn)
aug.matrix<-rbind(aug.matrix,train.text)
aug.asgn<-rbind(aug.asgn,assignment_type_train)
aug.out<-c(aug.out,train.scores)

Part 2: Callabacks

The goal of the network was to predict the students’ performance on an assessment question based on the text of the questions. An issue I was having was that there was a strong local minima where the network just used the type of assessment item (i.e. was it an exam or quiz questions?) and the basic vocabulary to predict the difficulty. In order to learn a more complex representation, the loss probably needed to go up before it would go back down; this is known as the deep double descent problem.

One solution to overcoming this is just to let the network run for a long time– but my data set was limited, and I did not yet have a GPU. So instead I created a callback function to cyclicaly modify the learning rate in order to give the networks “opportunities” to explore outside the local minima. The learning rate does slowly go down over time.

clr <- function(x) (sin(x/20)+1)*exp(-x/5000)/100
plot(clr(1:(80*(984/16))), cex=0.1, xlab="iteration", ylab="learning rate")

callback_lr_init <- function(x){
  iter <<- 0
  lr_hist <<- c()
}
callback_lr_set <- function(batch, logs){
  iter <<- iter + 1
  learning_r <- clr(iter)
  k_set_value(youtube.assignment.model$optimizer$lr, learning_r)
}
callback_lr_log <- function(batch, logs){
  lr_hist <<- c(lr_hist, k_get_value(youtube.assignment.model$optimizer$lr))
}

callback_lr <- callback_lambda(on_train_begin=callback_lr_init, on_batch_begin=callback_lr_set)
callback_logger <- callback_lambda(on_batch_begin=callback_lr_log)
LogMetrics <- R6::R6Class("LogMetrics",
                          inherit = KerasCallback,
                          public = list(
                            loss = NULL,
                            acc = NULL,
                            on_batch_end = function(batch, logs=list()) {
                              self$loss <- c(self$loss, logs[["loss"]])
                              self$acc <- c(self$acc, logs[["acc"]])
                            }
                          ))

# initializing "LogMetrics" object
callback_log_acc <- LogMetrics$new()

Part 3: Network Architecture.

This is the final arcitecture for the network. I had two inputs (a) hot-one-coded vector of assessment catogories and (b) the text of the assessment data. Because of this, I had to use the functional keras API, not the sequential model. I discuss several aspects of this model design in my paper.

# specify network arcitecture
assignment_input<-layer_input(shape = c(7),name="asgn1")
assignment_drop<-assignment_input%>%
  layer_dropout(rate=0.90)

text_input<-layer_input(shape=list(NULL),dtype = "int32",name="text")

class_embedding<-text_input %>%
  layer_embedding(input_dim = max_words, ncol(embedding_matrix), 
                  input_length = maxlen, name="questionemb")

lstm<-class_embedding%>%
  bidirectional(layer_lstm(units=32,dropout=.4,recurrent_dropout =.4))

concatenated<-layer_concatenate(list(lstm,assignment_drop))

dense <- concatenated %>%
  layer_dense(units = 24, activation="relu")

out_layer<-dense%>%
  layer_dropout(rate=0.4)%>%
  layer_dense(units=1,activation="relu",name="out")


youtube.assignment.model<-keras_model(list(text_input,assignment_input),out_layer)

summary(youtube.assignment.model)

get_layer(youtube.assignment.model, index = 2) %>% 
  set_weights(list(embedding_matrix)) %>% 
  freeze_weights()

youtube.assignment.model %>% compile(
  optimizer="rmsprop",
  loss="mean_squared_error",
  metrics=c("mean_squared_error")
)

###Run youtube.assignment.model
history<-youtube.assignment.model %>% keras::fit(
  list(as.matrix(aug.matrix),as.matrix(aug.asgn)),
  as.matrix(aug.out),
  epochs=40,
  batch_size=64,
  shuffle=TRUE,
  callbacks=list(callback_lr,callback_logger,callback_log_acc),
  validation_data=list(list(valX1, valX2), valY)
)

Share on

Twitter Facebook LinkedIn

Tanner Phillips, Ph.D. Candidate