Need help for choosing model and how to fine-tuning an AI for checking IGCSE papers

I have asked Gemini, ChatGPT and Copilot to get started. But the more I ask AI the more obvious it has become that I need help from real people.

My goal: Fine-tuning an AI to help grading IGCSE papers (first project is English papers). I plan to start by having a model that can grade the papers with a short comment like provided in the datasets. When that is working well then I will improve it by being able to give more detailed feedback on content and spelling, punctuation and grammar. My subscription to Office 365 means I can use Copilot for OCR so I do not need to worry about that.

Datasets: Edexcel provides multiple student examples with a comment of why a certain grading was obtained. So I think there will be around 150 samples of student answers.

Approach: From asking AI I have been given the suggestion that I should train my model in the cloud and then I could run it locally on my Mac M2 Air with 16GB of RAM.

Questions:

  1. Are there any guides I should read or watch before I even get started on this project?
  2. Which model would be the best fit for my project?
  3. For the complexity of the project is it ok if I train in the cloud and then run it on my computer?
  4. I assume someone have already done something similar and might have documented their project, but I have not been able to find any posts with a similar project, how would I locate such projects?

Looking forward to your help.

1 Like

I see you’ve already asked the AI, but just to be sure, here’s a summary from the AI for now.


You can do this with a small dataset, but the winning strategy is usually (1) a strong baseline with strict outputs + evaluation, then (2) LoRA fine-tuning only if the baseline is “close but inconsistent.” That approach keeps you safer on overfitting with ~150 examples, and it fits the “second marker” workflow where a teacher stays the final decision maker.

Also, if you are operating under UK-style regulated-qualification expectations, your design goal should explicitly be recommendation + evidence + human override, not autonomous marking. Ofqual explicitly notes students must not submit AI-produced material as their own, and that centres must not use AI as the sole marker for regulated qualification components. (GOV.UK) JCQ’s guidance similarly emphasizes authenticity and malpractice risk.


1) Guides to read or watch before you start

A. Policy, integrity, and “what your tool must not do”

These are not optional reading if you plan to use this in a real school workflow.

  • Ofqual guide for schools and colleges (2025): clear statements on AI risk, authenticity, and “not the sole marker.” (GOV.UK)
  • JCQ “AI Use in Assessments” (Revision 2025): practical centre/teacher responsibilities, what counts as misuse, and how marking should treat acknowledged AI use.

Why this matters technically: it pushes you toward auditability features (evidence quotes, retrieved exemplars list, rubric-alignment fields, flags) rather than a single opaque “grade.”


B. Automated Essay Scoring (AES) with LLMs: what research says to watch out for

LLMs can score essays, but reliability and consistency (especially near grade boundaries) are recurring issues, and prompt choice can shift outcomes.

Good starting points:

  • A recent LREC-COLING paper evaluating LLMs for AES reports results using Quadratic Weighted Kappa (QWK) and studies prompt effects and consistency.
  • A recent multi-trait scoring approach (“rationale-based” rubric alignment) is relevant to your future “detailed feedback” phase.

Why this matters technically: it tells you what to measure (agreement, stability, boundary confusions) and why you need repeated runs / calibration anchors.


C. Practical engineering guides you will actually use

Structured outputs and schema enforcement

  • Ollama structured outputs: enforce JSON or JSON Schema so your model output is machine-checkable. (Ollama)

Fine-tuning (SFT)

  • Hugging Face TRL SFTTrainer docs: includes completion-only loss (compute loss on the completion only), which is exactly what you want for prompt→JSON grading. (Hugging Face)
  • Known gotcha: completion-only loss can break with certain acceleration settings (example: use_liger_kernel=true dropping the completion mask). This is the kind of thing you want to catch early with a tiny regression test. (GitHub)

Tokenization / chat templates (train–inference parity)

  • Transformers chat templating best practice: if you call apply_chat_template(tokenize=False) and tokenize later, you must set add_special_tokens=False to avoid duplicating special tokens. Using tokenize=True is often safer. (Hugging Face)
  • There are also real-world reports of tokenize=True vs “format then encode” behaving differently in some cases, so you should pin a single approach and test it. (GitHub)

2) Which model is the best fit?

Given your constraints (MacBook Air M2, 16GB RAM, local inference, short JSON output, and only ~150 labeled examples), you want:

  • 7B–8B instruct model
  • Good instruction-following
  • Strong “structured output / JSON” behavior
  • Widely supported tooling (Transformers, GGUF/Ollama ecosystem, LoRA adapters)

My primary pick to start: Qwen2.5-7B-Instruct

Reasons:

  • The model card explicitly calls out improvements in instruction following and structured outputs especially JSON, plus long-context support (useful if you later add retrieved exemplars). (Hugging Face)
  • Apache-2.0 license (simpler to deploy in many environments). (Hugging Face)

Strong alternatives to benchmark (you should test 2–3, not just 1)

Mistral-7B-Instruct-v0.3

  • Apache-2.0 license and explicit support for function calling style usage, which often correlates with reliable structured responses. (Hugging Face)

Meta-Llama-3-8B-Instruct

  • Very common baseline with broad community support, but note the custom commercial license and its published context length and other constraints. (Hugging Face)

What “best” means in your project (practical selection criteria)

Run a small benchmark on your frozen test split and pick the model that wins on:

  1. Agreement with human grades (use an ordinal metric like QWK for banded scoring)
  2. Schema validity rate (how often it emits valid JSON that passes your validator)
  3. Evidence-quote validity rate (quotes must be exact substrings)
  4. Stability (same answer across 2–3 low-temperature replications)
  5. Latency on your Mac (teacher workflow tolerance)

On QWK specifically: AES work commonly uses it, and the Kaggle ASAP-style setup is a standard reference point.


3) Is it OK to train in the cloud and run locally on your Mac?

Yes. This is a normal pattern, and it is usually the best pattern for your hardware.

Why it works

  • LoRA fine-tuning updates a small number of added parameters while freezing the base model, massively reducing what you need to store and train.
  • QLoRA goes further by enabling efficient fine-tuning while the base is quantized (commonly discussed as 4-bit training). (arXiv)

How you get the trained result onto your Mac (Ollama path)

Ollama supports importing a fine-tuned Safetensors adapter with a Modelfile that references the base model plus your adapter directory. It also warns:

  • Use the same base model as used to create the adapter, or results can be erratic.
  • Because quantization methods vary, it is “best to use non-quantized (non-QLoRA) adapters” when importing adapters. (Ollama)

Local performance reality on a MacBook Air M2 16GB

  • You will almost certainly run quantized weights for the base model to fit memory and get reasonable speed.
  • Ollama explicitly frames quantization as trading some accuracy for much lower memory and better speed, enabling use on more modest hardware, and lists supported quantization levels. (Ollama)

Net of it: cloud training + local quantized inference is correct. Just be disciplined about base-model version matching, and test the exact on-device stack you will deploy.


4) How to find similar projects and examples

You’re right that “IGCSE grading assistant” is a niche search term. The trick is to search by the research/product category:

A. Use the right keywords (copy/paste searches)

Use combinations like:

  • “automated essay scoring LLM rubric”
  • “AES LLM quadratic weighted kappa”
  • “rubric aligned scoring rationale generation”
  • “prompted essay scoring Llama”
  • “LoRA fine-tune essay scoring JSON”

These queries surface academic baselines plus GitHub repos.

B. Follow citation trails from relevant papers

The LLM-for-AES paper screenshot above cites and compares multiple approaches and mentions standard datasets like ASAP.
Once you find one good paper, scroll the related-work section and search those titles.

C. Look for “multi-trait / rationale” scoring (closest to your future roadmap)

Your longer-term plan (detailed content + SPaG feedback) maps well to “trait-based” scoring and rationale extraction, like the rubric-guideline + rationale approach.

D. Use “standard AES dataset” hubs as discovery engines

Even if you never train on them, the ASAP/Kaggle ecosystem is a directory of metrics, repo code, and evaluation conventions.


A concrete fine-tuning plan that fits your data size

If you want something you can execute without guessing:

  1. Baseline first (1–2 days)

    • Prompt = rubric + task + student answer
    • Output = strict JSON with band/score + 1–3 sentence comment + evidence quotes
    • Enforce schema with structured outputs and validate server-side. (Ollama)
  2. Build your evaluation harness (same week)

    • Frozen test split
    • Metrics: QWK or ordinal agreement, boundary confusion counts, schema validity, quote validity
    • Stability: run 2–3 replications low temperature and flag disagreements
  3. Only then do LoRA SFT (when baseline is “close but wobbly”)

    • Use TRL SFTTrainer prompt→completion format and enable completion-only loss. (Hugging Face)
    • Add a regression test for the “completion-only loss mask” path to avoid known incompatibilities. (GitHub)
    • Lock your chat template approach and keep it identical at train and inference time. (Hugging Face)
  4. Export adapter and run locally

    • Import adapter in Ollama with FROM base + ADAPTER path.
    • Keep base model identical to training base. (Ollama)
    • Quantize the base for your Mac if needed, then run your smoke tests on-device. (Ollama)

Summary (key points)

  • Start with structured JSON outputs + strict validation before training. (Ollama)
  • For your hardware and dataset size, benchmark Qwen2.5-7B-Instruct first, plus 1–2 alternatives. (Hugging Face)
  • Cloud fine-tuning + local quantized inference is the right workflow. Use LoRA adapters and keep base-model versions identical. (Ollama)
  • Use AES research terms (“automated essay scoring”, “QWK”, “rubric”, “multi-trait”) to find similar projects, not “IGCSE” alone.

Thank you so much. There is so much new to learn in order to use the AI tools in a helpful way. So I think I will start by spending a few days to get to understand what all of the things mean and look up what you have suggested. With so many new concepts to understand I can see that I can easily end up spending a lot of time to eventually realize that I went in the wrong direction.

I am not planning of having AI correcting the papers, but having it as a tool to assist the teachers when they are correcting should be able to speed up their work.

1 Like