ChatGPT and assessment
ChatGPT (and similar generative AI technologies) have both created challenges and opportunities for higher education. This page aims to give you a primer on how these technologies work, their strengths and weaknesses, and how to view their use within assessment.
What is ChatGPT?
ChatGPT is of the first ‘Large Language Model’ (LLM) artificial intelligence tools that are both freely available and user-friendly. Although several other models and tools are available now, ChatGPT, because of its high usage and notoriety is the focus of this article. LLM AI tools scrape the internet for texts to learn from, and create a model made up of words and the relationships between them. As such, the model gains the ability to make a prediction of the next word within a given context, a little like the predictive texting model on your phone, but with the advantage of being applicable to very specific contexts.
Unlike other AI-driven text services, such as Microsoft Editor or Grammarly, which students (and educators!) may use to refine their own texts, ChatGPT can generate a text within the context given from a given prompt. The user can then accept the output or refine the prompt in order to obtain a better answer.
In common with all AI services, Large Language Models operate in a so-called black box, where we cannot easily see the relationship between the input and the output.
Whilst ChatGPT (and similarly other LLM services) can produce texts which seem very human, it is worth remembering that the algorithm has absolutely no understanding of the material (1,2), but instead, it is a statistically sound output based on its input parameters. The other aspect which is worth keeping in mind is that although the AI has clearly trained on masses of text, it does have gaps in its knowledge that it has no ability to recognize (2,3). That said, the shift from the GPT-3.5 to the GPT-4 model appears to have improved accuracy a lot, and this is likely to improve in the future.
Limitations of Generative LLM AI services
- The output is statistically viable in any given situation, but in practice, that means it is also normative of its input text which also mirrors the norms of the texts found on the internet (1). Biases, both subtle and visible can easily be seen towards western attitudes and values, though the programmers of some systems such as ChatGPT have clearly implemented safeguards to limit explicit sexism, racism, and other biases that are societally unacceptable, both through the programming and through pre-screening the material the system learns from (4).
Similarly, given certain prompts, the service will tend towards the extreme (for example when asked to describe a day in the life of a patient with a specified illness, the patient may include every conceivable symptom).
- Since the AI does not understand, it can also something confidently write text which is wrong (2, 5). These may be small factual errors, or bigger things, for example making up references to books that don’t exist. Since the model is statistical, it does not state when it is unsure about its statements. This highlights the importance of questioning our own tendency to conflate fluency with accuracy as highlighted in Martindale and Carpuat’s (2018) study of users’ trust in machine-translated tests (6).
- The AI does not always have access to specific texts (3). This can sometimes be shown in the answer.
Whilst there are a number of services that are now available that claim to be able to detect AI-generated output, these are often easy to avoid by making small changes to texts (7), and it is unclear how well they will hold up in the future as new LLM AI models emerge, and strategies to avoid detection are developed.
The current state of the assessment
As you may know, we have different assessment formats. In an essay, students write a response to the question without any clue from the teacher. This form of assessment is one of the most effective ways to check if the student can make a complex response to a challenging question (8). It is suitable for home exams and during the Covid pandemic, it was promoted by most of the universities as an alternative to in-site exams.
Recently, conversation AI developed to help us generate some text from existing literature on the web. It seems that this technology will help us to perform our jobs easier and better. So we need to include using it as a skill in higher education. Students, like other people, can use this tool to answer their exam questions.
There is a hot discussion about the impact of this tool on academic exams. As for now, only four types of them from our assessment portfolio could be affected by conversational AI (like ChatGPT). A simple solution is to use other alternatives (besides essays) to increase the validity and reliability of the assessment.
This diagram shows the assessment forms which are affected by ChatGPT and other similar services.
The ones which affected are all within the category Written Assessment, and include Essay, Short Answer Question, Completion Questions and Dissertation. There are however many other forms within this category which are entirely unaffected including Multiple Choice Questions, Extended Matching Questions, Patient Management Problems and Reports, as well as the entirely unaffected categories Clinical/Practical Assessments, Observations, Portfolios and Other Performance Records and Peer and Self-Assessment.
However, discussion around ChatGPT can raise some questions about the quality of our exams, as Daisy Christodoulou said in her blog post: “If we are setting assessments that a robot can complete, what does that say about our assessments?” (9)
Teachers’ reaction to ChatGPT
It is a new concept and tool and it takes some time like other technologies to adopt with. We as teachers may have three types of reactions to ChatGPT (By Dr. Philippa Hardman) (10):
- Veto: deter students from using AI by punishing them for doing so.
- Bypass: go around the AI problem by returning to in-class exam methods.
- Embrace: use AI as a tool to develop students' digital literacy and analytical skills.
We teach different courses in university at different levels. So the use/ban approach should be customized to different levels in academia. To have a better strategy, we may use the Structure of the Observed Learning Outcome (SOLO) taxonomy. (11) This taxonomy was first described by Kevin Collis and John Biggs in the early 80s. In the written assessment formats (which are suitable to assess knowledge), we often use SOLO taxonomy.
This diagram shows the SOLO taxonomy which was first described by Kevin Collis and John Biggs in 1982. SOLO, which stands for the Structure of the Observed Learning Outcome, is a means of classifying learning outcomes in terms of their complexity, enabling us to assess students’ work in terms of its quality not of how many bits of this and of that they have got right.
In some courses, we need to ensure students know the basics before starting the next level (declarative knowledge). (11) One proposal could be to ask students don’t use external resources during the assessment of this form of knowledge (books, and conversation AI). However, to reach to the “relational” and “extended abstract” levels, it is good to use some kind of AI to help students organize their declarative knowledge first and then try to make it more functional. Although because of the low validity of knowledge provided by these tools, the student should be able to analyze the data and use it with care. Some universities ask students to cite ChatGPT in their references if they use it.
AI and Plagiarism detection industries
Integration of AI in public search engines (like Microsoft and Google) helps people to access knowledge more affordable. In academia, we encountered the challenge of “AI is making cheating democratic!”. Previously rich students could benefit from human-generated cheating services!
On the other hand, companies like Turnitin (which recently owned Urkund), invest in such tools to help teachers to detect and protect academic integrity. Pioneer technology universities also developed their tools to build a detection tool by scholars (12). Even OpenAI which developed ChatGPT released a tool to help us detect if a text has been written by a human or an AI (https://openai-openai-detector.hf.space/).
Dialogue about AI and assessment
It is a challenge for all of us in academia. Students don't know how to use it in a proper way. Teachers don't know how to deal with assignments that are written by this tool. Academic policymakers are working on the ethical and legal consequences of promoting this tool in academia. We might start with teachers. At this stage, we think it's better to start a dialogue among us to understand the situation and help each other before having a clear recommended guideline.
Tips for teachers
- Some universities started to brief their teachers about it. (13) At the very beginning, it is important for you as a teacher to be familiar with AI tools (ChatGPT for example), and test them. For example, you can check your previous similar exam questions with ChatGPT to understand how responses vary. This helps you to understand if the students used these tools in writing their assignments. It is also recommended not to use your real exam questions because the machine learns them and will provide better answers for your students later on!
- If you aim at higher SOLO levels (including data synthesis or analysis), you may use an example from a recent paper or data from a lab experiment to create a more authentic essay. The other approaches that might be used would be oral examinations, projects, observation of discussion, and many more. (8).
- In recent digital exam tools, you can use other ways to ask for a response than a short text, for example, you can ask the students to label some parts in a diagram which might be hard for AI tools now.
- At this moment, the most important difference between us and machines is the ability to reason. You can let your students use ChatGPT, but ask them to argue about their responses or even justify if a ChatGPT answer is correct or not.
- Please ask students to prepare specific bibliographic references and check them! Sometimes ChatGPT creates some references which are not available in the real world! It's called hallucination and it could be a clue for you to help your students understand this tool might be harmful to them if they don't use it critically.
And finally, please think about what is the future of the workplace. How do we want to empower our students for that? If in the future we need to be able to use such tools on daily basis, are we preparing our students to be able to integrate them with their tasks?
Tips for students
To start a dialogue with our students (future colleagues), it is important to tell them that you are aware of the availability and rapid advances of text and image generation tools. Again considering SOLO taxonomy you can reason why you are letting them use it carefully or banning it in the examination phase.
Making the red line clear for them, will help them understand where they can use a tool like ChatGPT that could be considered dishonest and where it could be productive for them. For example, is generating a framework cheating? An entire essay? What do they think the quality will be like? How it could affect their learning? Sometimes they may have different perceptions of academic integrity and it is good to have an open discussion about it in the classroom.
By showing them a generated response, you can point out they often contain factual errors and they need to be able to analyze a text generated by this tool.
However current detection tools are not 100% perfect, you can tell them if your university has access to such tolls.
Tips for policymakers
Some university presidents started to reflect on this challenge. It is a developing field and they might change their minds, however, it is good to represent how the educational committee at the university thinks about this tool. How they are going to train their teachers about it and treat their students if they use it?
- Bender EM, Gebru T, McMillan-Major A, Shmitchell S. On the dangers of stochastic parrots: Can language models be too big? In: FAccT 2021 - Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 2021.
- ChatGPT General FAQ | OpenAI Help Center [Internet]. [cited 2023 Mar 16]. Available from: https://help.openai.com/en/articles/6783457-chatgpt-general-faq
- Why doesn’t ChatGPT know about X? | OpenAI Help Center [Internet]. [cited 2023 Mar 16]. Available from: https://help.openai.com/en/articles/6827058-why-doesn-t-chatgpt-know-ab…
- OpenAI Used Kenyan Workers on Less Than $2 Per Hour: Exclusive | Time [Internet]. [cited 2023 Mar 16]. Available from: https://time.com/6247678/openai-chatgpt-kenya-workers/
- OpenAI [Internet]. [cited 2023 Mar 16]. Available from: https://openai.com/
- Martindale MJ, Carpuat M. Fluency over adequacy: A pilot study in measuring user trust in imperfect MT. In: AMTA 2018 - 13th Conference of the Association for Machine Translation in the Americas, Proceedings. 2018.
- A short experiment in defeating a ChatGPT detector - National centre for AI [Internet]. [cited 2023 Mar 16]. Available from: https://nationalcentreforai.jiscinvolve.org/wp/2023/01/31/a-short-exper…
- Swanwick T, Forrest K, O’Brien BC. Understanding medical education: Evidence, theory, and practice third edition. Understanding Medical Education: Evidence, Theory, and Practice. 2018. 1–580 p.
- Christodoulou Daisy. If we are setting assessments that a robot can complete, what does that say about our assessments? | by Daisy Christodoulou | Feb, 2023 | The No More Marking Blog [Internet]. [cited 2023 Mar 15]. Available from: https://blog.nomoremarking.com/if-we-are-setting-assessments-that-a-rob…
- Veto, Bypass or Flip? - by Dr Philippa Hardman [Internet]. [cited 2023 Mar 15]. Available from: https://drphilippahardman.substack.com/p/vetoing-bypassing-and-flipping
- Biggs JB (John B, Tang CS kum., Society for Research into Higher Education. Teaching for quality learning at university : what the student does. 2011;389.
- Human Writer or AI? Scholars Build a Detection Tool [Internet]. [cited 2023 Mar 15]. Available from: https://hai.stanford.edu/news/human-writer-or-ai-scholars-build-detecti…
- AI, education and assessment: staff briefing #1 | Teaching & Learning - UCL – University College London [Internet]. [cited 2023 Mar 16]. Available from: https://www.ucl.ac.uk/teaching-learning/assessment-resources/ai-educati…