I see you’ve already asked the AI, but just to be sure, here’s a summary from the AI for now.
You can do this with a small dataset, but the winning strategy is usually (1) a strong baseline with strict outputs + evaluation, then (2) LoRA fine-tuning only if the baseline is “close but inconsistent.” That approach keeps you safer on overfitting with ~150 examples, and it fits the “second marker” workflow where a teacher stays the final decision maker.
Also, if you are operating under UK-style regulated-qualification expectations, your design goal should explicitly be recommendation + evidence + human override, not autonomous marking. Ofqual explicitly notes students must not submit AI-produced material as their own, and that centres must not use AI as the sole marker for regulated qualification components. (GOV.UK) JCQ’s guidance similarly emphasizes authenticity and malpractice risk.
1) Guides to read or watch before you start
A. Policy, integrity, and “what your tool must not do”
These are not optional reading if you plan to use this in a real school workflow.
- Ofqual guide for schools and colleges (2025): clear statements on AI risk, authenticity, and “not the sole marker.” (GOV.UK)
- JCQ “AI Use in Assessments” (Revision 2025): practical centre/teacher responsibilities, what counts as misuse, and how marking should treat acknowledged AI use.
Why this matters technically: it pushes you toward auditability features (evidence quotes, retrieved exemplars list, rubric-alignment fields, flags) rather than a single opaque “grade.”
B. Automated Essay Scoring (AES) with LLMs: what research says to watch out for
LLMs can score essays, but reliability and consistency (especially near grade boundaries) are recurring issues, and prompt choice can shift outcomes.
Good starting points:
- A recent LREC-COLING paper evaluating LLMs for AES reports results using Quadratic Weighted Kappa (QWK) and studies prompt effects and consistency.
- A recent multi-trait scoring approach (“rationale-based” rubric alignment) is relevant to your future “detailed feedback” phase.
Why this matters technically: it tells you what to measure (agreement, stability, boundary confusions) and why you need repeated runs / calibration anchors.
C. Practical engineering guides you will actually use
Structured outputs and schema enforcement
- Ollama structured outputs: enforce JSON or JSON Schema so your model output is machine-checkable. (Ollama)
Fine-tuning (SFT)
- Hugging Face TRL SFTTrainer docs: includes completion-only loss (compute loss on the completion only), which is exactly what you want for prompt→JSON grading. (Hugging Face)
- Known gotcha: completion-only loss can break with certain acceleration settings (example:
use_liger_kernel=true dropping the completion mask). This is the kind of thing you want to catch early with a tiny regression test. (GitHub)
Tokenization / chat templates (train–inference parity)
- Transformers chat templating best practice: if you call
apply_chat_template(tokenize=False) and tokenize later, you must set add_special_tokens=False to avoid duplicating special tokens. Using tokenize=True is often safer. (Hugging Face)
- There are also real-world reports of
tokenize=True vs “format then encode” behaving differently in some cases, so you should pin a single approach and test it. (GitHub)
2) Which model is the best fit?
Given your constraints (MacBook Air M2, 16GB RAM, local inference, short JSON output, and only ~150 labeled examples), you want:
- 7B–8B instruct model
- Good instruction-following
- Strong “structured output / JSON” behavior
- Widely supported tooling (Transformers, GGUF/Ollama ecosystem, LoRA adapters)
My primary pick to start: Qwen2.5-7B-Instruct
Reasons:
- The model card explicitly calls out improvements in instruction following and structured outputs especially JSON, plus long-context support (useful if you later add retrieved exemplars). (Hugging Face)
- Apache-2.0 license (simpler to deploy in many environments). (Hugging Face)
Strong alternatives to benchmark (you should test 2–3, not just 1)
Mistral-7B-Instruct-v0.3
- Apache-2.0 license and explicit support for function calling style usage, which often correlates with reliable structured responses. (Hugging Face)
Meta-Llama-3-8B-Instruct
- Very common baseline with broad community support, but note the custom commercial license and its published context length and other constraints. (Hugging Face)
What “best” means in your project (practical selection criteria)
Run a small benchmark on your frozen test split and pick the model that wins on:
- Agreement with human grades (use an ordinal metric like QWK for banded scoring)
- Schema validity rate (how often it emits valid JSON that passes your validator)
- Evidence-quote validity rate (quotes must be exact substrings)
- Stability (same answer across 2–3 low-temperature replications)
- Latency on your Mac (teacher workflow tolerance)
On QWK specifically: AES work commonly uses it, and the Kaggle ASAP-style setup is a standard reference point.
3) Is it OK to train in the cloud and run locally on your Mac?
Yes. This is a normal pattern, and it is usually the best pattern for your hardware.
Why it works
- LoRA fine-tuning updates a small number of added parameters while freezing the base model, massively reducing what you need to store and train.
- QLoRA goes further by enabling efficient fine-tuning while the base is quantized (commonly discussed as 4-bit training). (arXiv)
How you get the trained result onto your Mac (Ollama path)
Ollama supports importing a fine-tuned Safetensors adapter with a Modelfile that references the base model plus your adapter directory. It also warns:
- Use the same base model as used to create the adapter, or results can be erratic.
- Because quantization methods vary, it is “best to use non-quantized (non-QLoRA) adapters” when importing adapters. (Ollama)
Local performance reality on a MacBook Air M2 16GB
- You will almost certainly run quantized weights for the base model to fit memory and get reasonable speed.
- Ollama explicitly frames quantization as trading some accuracy for much lower memory and better speed, enabling use on more modest hardware, and lists supported quantization levels. (Ollama)
Net of it: cloud training + local quantized inference is correct. Just be disciplined about base-model version matching, and test the exact on-device stack you will deploy.
4) How to find similar projects and examples
You’re right that “IGCSE grading assistant” is a niche search term. The trick is to search by the research/product category:
A. Use the right keywords (copy/paste searches)
Use combinations like:
- “automated essay scoring LLM rubric”
- “AES LLM quadratic weighted kappa”
- “rubric aligned scoring rationale generation”
- “prompted essay scoring Llama”
- “LoRA fine-tune essay scoring JSON”
These queries surface academic baselines plus GitHub repos.
B. Follow citation trails from relevant papers
The LLM-for-AES paper screenshot above cites and compares multiple approaches and mentions standard datasets like ASAP.
Once you find one good paper, scroll the related-work section and search those titles.
C. Look for “multi-trait / rationale” scoring (closest to your future roadmap)
Your longer-term plan (detailed content + SPaG feedback) maps well to “trait-based” scoring and rationale extraction, like the rubric-guideline + rationale approach.
D. Use “standard AES dataset” hubs as discovery engines
Even if you never train on them, the ASAP/Kaggle ecosystem is a directory of metrics, repo code, and evaluation conventions.
A concrete fine-tuning plan that fits your data size
If you want something you can execute without guessing:
-
Baseline first (1–2 days)
- Prompt = rubric + task + student answer
- Output = strict JSON with band/score + 1–3 sentence comment + evidence quotes
- Enforce schema with structured outputs and validate server-side. (Ollama)
-
Build your evaluation harness (same week)
- Frozen test split
- Metrics: QWK or ordinal agreement, boundary confusion counts, schema validity, quote validity
- Stability: run 2–3 replications low temperature and flag disagreements
-
Only then do LoRA SFT (when baseline is “close but wobbly”)
- Use TRL SFTTrainer prompt→completion format and enable completion-only loss. (Hugging Face)
- Add a regression test for the “completion-only loss mask” path to avoid known incompatibilities. (GitHub)
- Lock your chat template approach and keep it identical at train and inference time. (Hugging Face)
-
Export adapter and run locally
- Import adapter in Ollama with
FROM base + ADAPTER path.
- Keep base model identical to training base. (Ollama)
- Quantize the base for your Mac if needed, then run your smoke tests on-device. (Ollama)
Summary (key points)
- Start with structured JSON outputs + strict validation before training. (Ollama)
- For your hardware and dataset size, benchmark Qwen2.5-7B-Instruct first, plus 1–2 alternatives. (Hugging Face)
- Cloud fine-tuning + local quantized inference is the right workflow. Use LoRA adapters and keep base-model versions identical. (Ollama)
- Use AES research terms (“automated essay scoring”, “QWK”, “rubric”, “multi-trait”) to find similar projects, not “IGCSE” alone.