Evaluating Artificial Intelligence (AI) marking tools for IELTS writing and speaking papers: Reliability and functionality

Proceedings of the 2nd International Education Conference

Year: 2024

DOI:

[PDF]

 

Evaluating Artificial Intelligence (AI) marking tools for IELTS writing and speaking papers: Reliability and functionality

Yuet Wai Wong

 

ABSTRACT:

With the recent advancement of Artificial Intelligence (AI), there has been a dramatic leap of improvement on language assessment tools.  This study evaluated the reliability and functions of two marking tools for IELTS writing and speaking papers: Writing9 and Smalltalk2, which had been used by university students.  To check whether the tools could provide fair and objective assessment, the generated scores were compared with human scores with reference to the IELTS marking descriptors.  Statistical analysis was conducted to determine the lnter-Rater Reliability (IRR).  The analysis included the application of Cohen’s Kappa, Intraclass Correlation Coefficient (ICC), Mean Difference, Standard Deviation, Variance and Bland-Altman Plot.  It was found that the two marking tools could provide moderate agreement with the scores assigned by human, as they had limitations of marking the papers holistically and taking other linguistic elements into consideration.  Additionally, to explore the types of functions provided by the tools, the AI-generated feedback was studied.  It was discovered that the AI-generated quantitative and qualitive feedback was diagnostic in nature and it could draw students’ attention to their strengths and weaknesses in language knowledge and skills.  This was regarded as a very useful function for assessment for language learning.  It was concluded that the two instruments were practical.  However, the design and model training of the tools should be further improved and tested, and further research should also be conducted to assess the validity of scores.

keywords: AI-generated feedback; assessment for learning; IELTS marking descriptors; Inter-Rater Reliability (IRR); language assessment tools