Automatic Item Generation (AIG) technology has emerged as a transformative approach to address fundamental challenges in psychological and educational test development, including high item creation costs, low development efficiency, difficulty in maintaining large item banks, and security vulnerabilities from item exposure in high-stakes testing contexts. This review systematically examines the technological evolution from rule-driven methods to large language model (LLM)-based approaches and analyzes contemporary implementation challenges with corresponding solutions.
The technological evolution of AIG encompasses three distinct phases. Early rule-based methods, including cognitive design systems, item modeling approaches, and ontology-based techniques, relied heavily on expert knowledge and predefined templates. While producing structurally sound items, these approaches required substantial professional input and lacked flexibility in handling complex linguistic phenomena. Corpus-based methods subsequently introduced statistical approaches leveraging large-scale language data to enhance linguistic authenticity, though remained constrained by corpus coverage and domain specificity. Deep learning technologies marked a paradigm shift in AIG capabilities. Classical techniques such as word embeddings, recurrent neural networks (RNN), long short-term memory (LSTM), and sequence-to-sequence (Seq2Seq) models progressively improved semantic representation and text coherence. The introduction of Transformer architecture with its self-attention mechanism revolutionized natural language processing by effectively capturing long-range dependencies. Building on this foundation, pre-trained large language models like BERT, T5, and the GPT series learned rich language representations from massive text corpora, enabling sophisticated understanding and generation capabilities that fundamentally transformed AIG approaches. Contemporary LLM-based AIG systems employ domain fine-tuning and prompt engineering strategies, with knowledge enhancement technologies—particularly retrieval-augmented generation (RAG) and knowledge graphs—addressing professional knowledge accuracy by integrating external structured knowledge sources. LLM-based AIG demonstrates significant advantages over traditional methods, including dramatically improved generation efficiency, enhanced linguistic fluency and diversity, reduced dependence on extensive manual template creation, and the ability to generate contextually rich items across multiple domains and languages.
Despite technological advances, LLM-based AIG implementation faces seven core challenges across quality assurance, functional expansion, and practical application dimensions:
First, content authenticity and professional accuracy issues, particularly in specialized domains like medicine and law where LLMs exhibit “hallucination” phenomena. Solutions include employing larger-scale models (e.g., GPT-4 over GPT-3.5), implementing human-AI collaborative frameworks where experts guide initial generation and validate outputs, deploying RAG systems with domain-specific knowledge bases and sophisticated retrieval mechanisms, and establishing comprehensive expert review protocols covering content validity, professional accuracy, conceptual depth, and structural integrity.
Second, ethical responsibility, cultural fairness, and construct validity concerns, especially in personality and social attitude assessments where models may perpetuate training data biases. Strategies encompass data augmentation to balance cultural representation, multi-stage psychometric validation progressing from expert screening through pilot testing to large-scale validation, Differential Item Functioning (DIF) analysis for cross-group measurement equivalence, and adoption of Exploratory Structural Equation Modeling (ESEM) over traditional confirmatory factor analysis for AI-generated items.
Third, single-modality limitations restricting assessments to text-based items, excluding visual-spatial reasoning and multimodal content. Multimodal large language models (MLLMs) like GPT-4o and Phi-3-vision offer solutions, though require specialized prompt engineering frameworks translating assessment requirements into multimodal representations while maintaining psychometric properties and disciplinary standards.
Fourth, insufficient capability for generating open-ended items assessing higher-order thinking skills. Solutions involve integrating Bloom’s taxonomy or similar frameworks into generation prompts, developing synchronized scoring rubrics and exemplar responses, and creating evaluation criteria encompassing cognitive complexity, response diversity, and scoring feasibility.
Fifth, quality control reliance on manual review contradicting efficiency goals. Emerging intelligent evaluation systems demonstrate promise, though require independent training from generation models, domain-specific evaluation models, and integration with human expert final review in hybrid quality assurance systems.
Sixth, resource constraints limiting access to computational infrastructure. Parameter-efficient fine-tuning techniques (LoRA, QLoRA) reduce memory requirements while maintaining performance. Cloud computing services provide on-demand resources, and prompt engineering optimization offers low-resource alternatives for institutions unable to fine-tune models.
Seventh, technical complexity creating barriers for non-technical domain experts. User-friendly interfaces should abstract API interactions, provide template libraries for common assessment types, offer real-time preview and editing capabilities, and integrate quality feedback mechanisms guiding users toward best practices.
This review reveals that successful LLM-based AIG implementation requires coordinated solutions addressing domain knowledge, technological, psychometric, and practical challenges. Key priorities include optimizing knowledge enhancement technologies for domain-specific accuracy, adapting validation methods for AI-generated content, and establishing human-AI collaborative workflows that leverage computational efficiency while maintaining professional standards. The convergence of these approaches—systematic RAG implementation, multi-stage psychometric validation, parameter-efficient training methods, and accessible user interfaces—provides a practical framework for advancing test development. Future directions should focus on empirical validation of these strategies across diverse assessment contexts and the establishment of standardized protocols that ensure both innovation and measurement quality in the evolving landscape of automated item generation.