Process Mining Student Discourse

Student Epistemics

Introduction

I’ve included this code in my portfolio not because it’s the most complicated, but because it shows some very basic principles of coding. Functions, efficient looping, and data manipulation. There’s very little tidyverse here, just base R. It’s easy to write this code badly.

The underlying problem is turning a list of items into a frequency matrix based on complex conditions. The data is student chat from a computer-supported collaborative learning environment (i.e., an educational video game). We were attempting to use processes mining to understand the order of types of speech (e.g., question, assertions of fact, social organization). To do this we wanted to get frequency counts of pairs and triplets of types of speach, like a question, followed by statement of fact, followed by another question.

I’m very proud of (a) The efficient speed at which this code ran when applied 10,000 lines of student chat, and (b) figuring out a simple way to visualize the data that allowed us to make interesting inferences about the data. As of February 2022, we are currently in the process of writing this up as a journal article.

Definitions

The definitions of the different codes may help with understanding the ouput:

K-. A question or other query for information. Often in speech our questions are implicit, not explicit.
K+. A direct knowledge claim or assertion.
Reply. A reply to previous comment. Would be meaningless outside of context.
Reply - Knowledge. A reply that includes new knowledge.
Reply - Hedge. A reply that “hedges” what is being said (e.g. I’m not sure, but… etc.)
Social Organization. Attempts to organize from introductions like “hello” to more explicit organizaton like “what should our next step be.
Other. Anything else. Often spam or off-topic

Code

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(RColorBrewer)

#Support constant is used to determine the % Cutoff for patterns in student speech. i.e. if less that support % of speech patterns fall into this category, we don't carry them forward.
support<-.05

Custom Functions

#Custom Function to calculate marginal percentages for rows or columns of a matrix
margin_prop<-function(x){
  s<-sum(x)
  x/s
}

# Custom function used to select locations in the data where a certain sequence of 2 or 3 codes is present
epis_pattern<-function(vector,pattern1,pattern2,pattern3=NA){
  locations<-c()
  if(is.na(pattern3)){
    for(i in 1:(length(vector)-1)){
      if(vector[i]==pattern1 & vector[i+1]==pattern2){
        locations<-c(locations,i)
      }
    }
  }else{
    for(i in 1:(length(vector)-2)){
      if(vector[i]==pattern1 & vector[i+1]==pattern2 & vector[i+2]==pattern3){
        locations<-c(locations,i)
      }
    }
  }
  return(locations+1)
}

Load and Clean

  codes<-read.csv("EcoJourney_DiscourseData_All_v2.csv")
  codes<-codes[,1:7]
  names(codes)[1]<-"GroupID"
  
#Split out the epistemic column. This is just for convenience.
  epis<-codes$Epistemics

  Wizard.indecies<-grepl("w",codes$UserID,ignore.case = T)
  Wizard.indecies.IDS<-which(grepl("w",codes$UserID,ignore.case = T))

Analysis 1: Epistemic Pairs

###Create epistemic pairs matrix
  freq1<-matrix(0,nrow=length(unique(epis)),ncol=length(unique(epis)))
  wizard_freq1<-matrix(0,nrow=length(unique(epis)),ncol=length(unique(epis)))

###Name rows and columns. We'll use these to select cells in the matrices  
  rownames(freq1)<-unique(epis)
  colnames(freq1)<-unique(epis)
  
  rownames(wizard_freq1)<-unique(epis)
  colnames(wizard_freq1)<-unique(epis)

###Comb through coded text and find couplets of speech types and put frequencies in matrix.   
  for(i in 1:(length(epis)-1)){
    freq1[epis[i],epis[i+1]]<-(freq1[epis[i],epis[i+1]]+1)
    if(i %in% Wizard.indecies.IDS){
      wizard_freq1[epis[i],epis[i+1]]<-(wizard_freq1[epis[i],epis[i+1]]+1)
    }
  }

###Transform into frequencies.
  freq1_prob<-t(apply(freq1,MARGIN=1,margin_prop))
  rownames(freq1_prob)<-names(freq1_prob)
  matrix_rows = sum(freq1_prob > support)
  support1<-matrix(0,nrow = matrix_rows,ncol=2)

###Select all supported discourse pairs to carry forward into triplet analysis.
  k = 1
  for(i in 1:nrow(freq1_prob)){
    for(j in 1:ncol(freq1_prob)){
      if(freq1_prob[i,j]>support){
        support1[k,]<-c(rownames(freq1)[i],colnames(freq1)[j])
        k = k + 1
      }
    }
  }
  
freq1

##                     Other Social organization  K-  K+ Reply Reply-Hedge
## Other                1103                  90 146 151   213          23
## Social organization    60                 156 104  92   144          24
## K-                    133                  47 254 147   581         131
## K+                    127                  65 208 310   383          39
## Reply                 259                 184 534 393  1440         123
## Reply-Hedge            29                  18  91  35   158          81
## Reply-Knowledge        47                  38 167  52   156          52
##                     Reply-Knowledge
## Other                            32
## Social organization              18
## K-                              211
## K+                               48
## Reply                           142
## Reply-Hedge                      61
## Reply-Knowledge                 214

Analysis 2: Epistemic Triplets

###Initiate matrices
  freq2<-matrix(0,nrow=nrow(support1),ncol=nrow(freq1))
  wizard_freq2<-matrix(0,nrow=nrow(support1),ncol=nrow(freq1))
  
###name rows and columns 
  rownames(freq2)<-paste(support1[,1],support1[,2],sep = "->")
  colnames(freq2)<-colnames(freq1)
  
  rownames(wizard_freq2)<-paste(support1[,1],support1[,2],sep = "->")
  colnames(wizard_freq2)<-colnames(freq1)
  
###Grab Epistemic Triplets. Essentialy same as for couplets.
  for(i in 1:(length(epis)-3)){
    pattern = paste(epis[i],epis[i+1],sep = "->")
    if(pattern %in% rownames(freq2)){
      freq2[pattern,epis[i+2]]<-freq2[pattern,epis[i+2]]+1
      if(i %in% Wizard.indecies.IDS){ 
        wizard_freq2[pattern,epis[i+2]]<- wizard_freq2[pattern,epis[i+2]]+1
      }
    }
  }  
  
###Save percentage frequencies by row
  freq2_prob<-as.data.frame(t(apply(freq2,MARGIN=1,margin_prop)))
  colnames(freq2_prob)<-colnames(freq2)
head(freq2)

##                            Other Social organization K- K+ Reply Reply-Hedge
## Other->Other                 828                  47 61 83    69           8
## Other->Social organization    18                  30 10 15    15           2
## Other->K-                     37                   3 27 13    43           6
## Other->K+                     46                   8 23 36    32           3
## Other->Reply                  51                   5 30 22    90           7
## Social organization->Other    31                   8  9  6     5           0
##                            Reply-Knowledge
## Other->Other                             5
## Other->Social organization               0
## Other->K-                               17
## Other->K+                                3
## Other->Reply                             8
## Social organization->Other               1

Results

Unpacking these results is the topic of the paper we are currently writing and would make this document a bit unruly. See the “Student Discourse Results” powerpoint in my public github repo if you’re interest in a quick overview.

###Full Results      
  heatmap(freq1,
          main="Chat Pair Frequencies",
          margins = c(12,12),
          ylab="First Chat",
          xlab="Second Chat")

  heatmap(as.matrix(freq2_prob),
          margins=c(12,12),
          main="Chat Triplet Frequencies",
          ylab="First & Second Chat",
          xlab="Third Chat")

###Wizard Results
  heatmap(as.matrix(wizard_freq1),
          main="Chat Pairs Initiated by Instructor",
          margins=c(12,12),
          ylab="First Chat",
          xlab="Second Chat")

  heatmap(as.matrix(wizard_freq2),
          main="Chat triplets Initiated by Instructor",
          xlab="Third Chat",
          margins=c(12,12))

####Select Lines for qualitative review
  lines<-epis_pattern(epis,"Reply-Knowledge","K-","K-")
  lines

##  [1]  247 1522 1993 2309 2412 4980 5226 5663 5677 5685 5732 6764 7496 7515 7528
## [16] 7694 8389 8846 8921

Share on

Twitter Facebook LinkedIn

Tanner Phillips, Ph.D. Candidate