Improving FAQ Retrieval for Academic Regulations Using Semantic Embeddings and LLM Question Augmentation
DOI:
https://doi.org/10.55583/jtisi.v4i1.2176Keywords:
semantic retrieval, FAQ retrieval, IndoSBERT, question augmentation, academic regulationsAbstract
Academic regulations in higher education are often documented in lengthy and formal handbooks, making it difficult for students to find relevant information using everyday language. This study developed a semantic FAQ retrieval system for academic regulations using IndoSBERT and question augmentation. The FAQ corpus was constructed from official academic and internship documents, resulting in 92 FAQ entries across 33 topical categories. Seed questions were generated from category–keyword pairs and expanded using simple rule-based augmentation and FLAN-T5-based paraphrasing. The dataset was evaluated using an 80:10:10 train–validation–test split. IndoSBERT was fine-tuned with Multiple Negatives Ranking Loss under three configurations: baseline, baseline with simple augmentation, and baseline with simple plus LLM-based augmentation. Retrieval performance was measured using Recall@1, Recall@3, Recall@5, and Mean Reciprocal Rank. The best result was achieved by the simple plus LLM augmentation configuration, with Recall@1 of 0.7848, Recall@5 of 0.8987, and MRR of 0.8396. These findings show that LLM-based question augmentation improves semantic retrieval robustness while keeping answers grounded in curated academic regulations.

