Academic Project Page

Overview

As Large Language Models (LLMs) demonstrate increasing potential in complex legal tasks like argument generation, evaluating their reliability, especially regarding factual accuracy and adherence to instructions, is crucial. To address this, we introduce an automated pipeline for evaluating LLM performance on generating 3-ply case-based legal arguments. This pipeline specifically focuses on measuring faithfulness (absence of hallucination), factor utilization (completeness), and appropriate abstention. We encourage you to read the paper for more details and explore our findings. If you have questions, please reach out to us!

Figure 1: 3-ply Argument Scheme

Automated Evaluation Pipeline

Our automated method employs an external LLM to extract factors from arguments generated by models under test. These extracted factors are then compared against ground-truth factors from the input case materials. This allows us to quantify faithfulness (hallucination accuracy), factor utilization recall, and abstention ratio.

Figure 2: Evaluation Pipeline Flowchart

We evaluated eight distinct LLMs on three tests of increasing difficulty:

Test 1: Standard 3-ply argument generation This test assesses baseline performance when arguments are factually supported.
Test 2: Argument generation with swapped precedent roles This tests the LLM's robustness in selecting the appropriate precedent when typical roles are reversed.
Test 3: Recognizing impossibility of argument generation and abstaining This directly probes the LLM's ability to identify task impossibility and adhere to negative constraints.

Takeaways

Our automated pipeline provides a scalable method for assessing crucial LLM behaviors in legal argument generation. We encourage you to read the paper for more details, but here are some highlights of our findings:

High Hallucination Accuracy in Viable Scenarios: Current LLMs achieve high accuracy (over 90%) in avoiding hallucination on viable argument generation tests (Tests 1 & 2), meaning they generally avoid citing non-existent factors.
Incomplete Factor Utilization: While accurate, LLMs often fail to utilize the full set of relevant factors present in the cases, leading to less comprehensive arguments. Factor Utilization Recall varied significantly (from approx 40% to approx 85% in Tests 1 & 2).
Significant Failure in Abstention: Critically, on the abstention test (Test 3), most models failed to follow instructions to stop, instead generating spurious arguments despite the lack of common factors. Only a few models showed a significant ability to abstain, with GPT-40 performing best (86.67% abstention ratio).
Automated Metrics are Effective: The proposed automated metrics (Hallucination Accuracy, Factor Utilization Recall, and Abstention Ratio) successfully quantified distinct error types and revealed performance variations across different LLMs and task complexities.

Overall, significant improvements are needed in ensuring comprehensive reasoning and, most crucially, in robust instruction following, particularly regarding negative constraints and the ability to abstain appropriately, before LLMs can be reliably deployed for substantive legal argumentation tasks.

Acknowledgments

We thank the Intelligent Systems Program at the University of Pittsburgh and the School of Computer Science at Carnegie Mellon University.

BibTeX

@misc{zhang2025measuring,
        title={Measuring Faithfulness and Abstention: An Automated Pipeline for Evaluating LLM-Generated 3-ply Case-Based Legal Arguments},
        author={Li Zhang and Morgan Gray and Jaromir Savelka and Kevin D. Ashley},
        year={2025},
        eprint={2506.00694},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
    }