© 2024 Blaze Media LLC. All rights reserved.
Blaze News investigates: Texas prepares to cut costs by using AI to grade students' state-required exams
Photo by: Photographer name/Education Images/Universal Images Group via Getty Images

Blaze News investigates: Texas prepares to cut costs by using AI to grade students' state-required exams

Students sitting for the STAAR exam throughout the state of Texas will have their written answers automatically graded by artificial intelligence. The development comes as some fear AI could significantly affect education in the near future.

The exam — provided by the Texas Education Agency — will use an "automated scoring engine" to help grade open-ended questions on the State of Texas Assessment of Academic Readiness for writing, reading, science, and social studies. The move is a step beyond using machines to simply grade multiple-choices answers or Scantron sheets.

It appears the state is prepared to save $15-$20 million per year by significantly reducing the number of humans who have to grade the tests. The plan in 2024 is to hire just 2,000 graders, representing a steep cut from the 6,000 human graders who were hired in 2023.

The technology that will be utilized to grade the long-form answers uses natural language processing, which is just one of the building blocks of artificial intelligence that well-known chatbots use, such as GPT-4.

Blaze News reached out to John Symons — professor of philosophy and director of the Center of Cyber Social Dynamics at the University of Kansas — who appeared to be optimistic about AI's ability to effectively grade STAAR exams without any significant setbacks.

"We're basically looking for thresholds, how can the kids hit those thresholds? And I think, you know, contemporary LLM [large language models] will be able to catch most of those cases and do just fine."

Since the STAAR exam was restructured in 2023, it now features fewer multiple-choice questions and more open-ended questions. These open-ended questions are also known as constructed response items.

Jose Rios, the director of student assessment at the Texas Education Agency, said, "We wanted to keep as many constructed open-ended responses as we can, but they take an incredible amount of time to score."

Before the AI technology would be capable of accurately grading students' exams, the Texas Education Agency had to develop a scoring system. The scoring system was developed by the agency, which gathered 3,000 responses that underwent two rounds of human grading.

After the responses were graded by humans, the automated scoring engine learned the specific characteristics of responses, allowing it to assign the same scores as would have been given by a human.

When he was asked about the potential for AI to make mistakes, Symons noted that the technology being used is sophisticated enough to identify answers in a manner similar to how humans find a specific shade on a color wheel. Between red and blue on a color wheel, there are tens, possibly hundreds, of shades of blue and red.

"So if you think about something like a color wheel, right, and you think about the location of a color in a color wheel, you'd say this particular shade of red has a particular coordinate in this color wheel, right? You could represent that with two numbers," Symons said.

"Now imagine making that three-dimensional, right? And so you could say okay, not just the color but the shade of the color, or the brightness of the color, and you could add other dimensions like the warmth of the color. And eventually, you'd have a multi-dimensional space of color."

"And what these [AI] systems do is they can do that with concepts."

Despite the AI's sophisticated nature, there is still the possibility that it will generate a "low confidence" response in the score it assigned. If this happens, the score will be double-checked by humans. The same is true if the AI meets a response to a question that its programming does not recognize, such as an answer that uses slang terms or words that are not in the English language.

Chris Rozunick — the division director for assessment development at the Texas Education Agency — said that the agency has "always had very robust quality control processes with humans," and that the same is true for technologies that will replace humans in the grading process.

In addition to humans supporting AI's grading efforts, a random sampling of student responses will automatically be pushed to humans to double-check that the technology is on the right track.

Software engineer and technologist Mike Wacker told Blaze News that it is imperative AI not suddenly "replace humans ... without evidence that the AI performs comparably to humans. You can't trust a rating just because it was produced by a fancy AI."

However, the Texas Education Agency has pushed back against the notion that it is using AI to grade the exams. The agency has conceded that while the automatic graders will use technology similar to GPT-4 and Google's Gemini, humans will still retain significant oversight over the process.

Rozunick said that "we are way far away from anything that's autonomous or can think on its own."

Even though the agency has insisted that the technology is a reliable grading mechanism, educators across the state have expressed their pessimism about the prospect of a machine grading a child's work.

Kevin Brown, who is the executive director for the Texas Association of School Administrators, said that "there ought to be some consensus about, hey, this is a good thing, or not a good thing, a fair thing or not a fair thing."

Carrie Griffith, a policy specialist for the Texas State Teachers Associations, appeared to echo Brown's sentiment, saying that "there's always this sort of feeling that everything happens to students and to schools and to teachers and not for them or with them."

The STAAR exam results are a pivotal accountability system that the Texas Education Agency uses to determine the proficiency of learning in school districts and individual campuses across the state.

Each district and individual campus is graded on a A-F scale. If a district or individual campus is discovered to have underperformed on the exam, the Texas education commissioner is obliged to intervene.

The stakes are high for campuses and districts across the state, which is partly why some feel uneasy about leaving the students' exams in the hands of sophisticated technology.

It has yet to be seen how sophisticated technology will handle grading the state's required exams and the unforeseen consequences it may present.

Like Blaze News? Bypass the censors, sign up for our newsletters, and get stories like this direct to your inbox. Sign up here!

Want to leave a tip?

We answer to you. Help keep our content free of advertisers and big tech censorship by leaving a tip today.
Want to join the conversation?
Already a subscriber?