AI graded person rescue? The teacher may finally be back

New University of Georgia explores how language models compare to human educators when evaluating student work
For teachers drowned in grade groups, artificial intelligence may bring a glimmer of hope. New research from the University of Georgia shows that large language models (LLMS) can help simplify the grading process, although the technology still does not completely replace human judgment. The study, published in technology, knowledge, and learning, reveals the potential and limitations of using AI tools to evaluate complex student assignments.
The study is at a critical moment, with many educators facing growing pressure to implement interactive learning methods while still providing timely feedback to students.
Challenges of modern science education
Science teachers face a particular challenge in adopting the next-generation scientific standards that emphasize students’ arguments and investigations rather than simple memory. These complex tasks create significant hierarchical burdens.
“It is very complicated to ask children to draw models, write explanations, and argue with each other,” said Zhai, the corresponding author of the study and associate professor and director of the AI4STEM Center for Education at UGA’s Mary Francis School of Early Education. “Teachers usually don’t have enough time to rate all students’ responses, which means students won’t be able to receive feedback in time.”
This feedback bottleneck can greatly affect students’ learning progress, especially when timely guidance is crucial to the development of scientific thinking skills.
How AI handles the hierarchy process
The study examines how the large language model Mixtral evaluates written responses to scientific questions by middle school students. In one example, students are asked to create a model that shows what happens to particles during heat transfer. Researchers want to see how AI grading processes compare to human approaches.
Although AI can generate assessment titles and assign scores almost immediately, the researchers found fundamental differences in how technology is assessed compared to human teachers.
Key findings about AI grading:
- LLM can answer much faster than human teachers
- AI usually relies on discovering keywords rather than evaluating a complete understanding
- No artificial columns, AI scores only 33.5% accuracy
- When visiting the Human Column, the accuracy is increased to just over 50%
- AI tends to “over-excessive” student understanding based on limited evidence
The research team found that AI graders often make shortcuts during the assessment process, looking for specific keywords instead of evaluating the basic logic and reasoning in student answers.
“Students can mention temperature rise, and large language models can explain that all students move faster when temperature rises,” Zhai said. “But based on student writing, as humans, we can’t infer whether students know whether particles will move faster.”
Improve AI grading accuracy
Despite these limitations, researchers still have the potential to improve. By providing AI with detailed artificially created columns outlining specific evaluation criteria, the accuracy of the technology can be significantly improved. These titles can help AI understand the deeper analytical thinking process that human distributors use when evaluating students’ work.
Can AI ultimately completely replace human hierarchy? Researchers warn of this approach, instead, suggests that AI may best serve as an assistant to human teachers rather than a replacement.
“The train has left the station, but has just left the station,” said Zhai. “This means we still have a long way to go when it comes to using AI, and we still need to figure out which direction.”
Realistic impact on educators
Despite current limitations, teachers involved in the relevant research expressed enthusiasm about the potential time-saving benefits of AI grading tools.
“Many teachers told me, ‘I have to spend a weekend giving feedback, but by using automatic scoring, I don’t have to do that. Now, I have more time to focus on more meaningful work than some labor-intensive work,” Zhai said. “It’s very encouraging for me.”
As AI technologies continue to evolve, these tools may become increasingly valuable for educators seeking to balance a comprehensive assessment and manageable workload. The key seems to be finding the right partnership between human judgment and technical efficiency, which allows teachers to focus their expertise where it matters most while leveraging AI to handle routine aspects of assessment.
If our report has been informed or inspired, please consider donating. No matter how big or small, every contribution allows us to continue to provide accurate, engaging and trustworthy scientific and medical news. Independent news takes time, energy and resources – your support ensures that we can continue to reveal the stories that matter most to you.
Join us to make knowledge accessible and impactful. Thank you for standing with us!