As Large Language Models (LLMs) demonstrate increasing potential in complex legal tasks like argument generation, evaluating their reliability, especially regarding factual accuracy and adherence to instructions, is crucial. To address this, we introduce an automated pipeline for evaluating LLM performance on generating 3-ply case-based legal arguments. This pipeline specifically focuses on measuring faithfulness (absence of hallucination), factor utilization (completeness), and appropriate abstention. We encourage you to read the paper for more details and explore our findings. If you have questions, please reach out to us!
Figure 1: 3-ply Argument Scheme
Our automated method employs an external LLM to extract factors from arguments generated by models under test. These extracted factors are then compared against ground-truth factors from the input case materials. This allows us to quantify faithfulness (hallucination accuracy), factor utilization recall, and abstention ratio.
Figure 2: Evaluation Pipeline Flowchart
We evaluated eight distinct LLMs on three tests of increasing difficulty:
Our automated pipeline provides a scalable method for assessing crucial LLM behaviors in legal argument generation. We encourage you to read the paper for more details, but here are some highlights of our findings:
Overall, significant improvements are needed in ensuring comprehensive reasoning and, most crucially, in robust instruction following, particularly regarding negative constraints and the ability to abstain appropriately, before LLMs can be reliably deployed for substantive legal argumentation tasks.
We thank the Intelligent Systems Program at the University of Pittsburgh and the School of Computer Science at Carnegie Mellon University.
@misc{zhang2025measuring,
title={Measuring Faithfulness and Abstention: An Automated Pipeline for Evaluating LLM-Generated 3-ply Case-Based Legal Arguments},
author={Li Zhang and Morgan Gray and Jaromir Savelka and Kevin D. Ashley},
year={2025},
eprint={2506.00694},
archivePrefix={arXiv},
primaryClass={cs.CL}
}