Biomedical relation extraction (BioRE) is a critical process for converting unstructured biomedical literature into structured knowledge, but traditional supervised approaches depend on costly annotated datasets that limit scalability across relation types and domains. A preprint on arXiv (ID: 2606.15412) authored by Mraz, Jakob, Curk, Tomaž, and Zupan, Blaž investigates whether large language models (LLMs) can serve as a viable alternative through few-shot prompt-based learning.
Task Formulations and Experimental Design
The study compares two task formulations for few-shot BioRE: pairwise classification, which predicts relations for individual entity pairs, and joint generation, which extracts multiple relations in a single model call. Experiments were conducted on the BioREDirect dataset. The authors report a clear precision-recall trade-off between the two approaches.
| Formulation | Precision | Recall | Efficiency |
|---|---|---|---|
| Pairwise classification | Lower | Higher | Lower |
| Joint generation | Higher | Lower | Higher |
The joint generation method is more computationally efficient but sacrifices recall, while pairwise classification captures more relations at the cost of precision.
Key Performance Metrics
The best-performing model achieved a micro-F1 score of 0.44, substantially outperforming previous few-shot results (0.34) but remaining below the supervised baseline (0.56). Notably, much of this gap is attributable to a single ambiguously defined relation type. When evaluated using macro-F1, which better captures performance across imbalanced relation types, prompt-based approaches outperformed the supervised baseline (0.45 vs. 0.38), particularly on rare relation types.
Implications for Low-Resource Applications
These findings underscore the potential of LLMs for BioRE in low-resource settings where annotated data is scarce. The superior macro-F1 performance on rare types suggests that LLMs can generalize better to less frequent relations, a common challenge in biomedical domains. However, the study emphasizes the importance of well-defined relation schemas to avoid ambiguity that degrades performance.
Limitations and Future Directions
While prompt-based learning shows promise, the micro-F1 gap indicates that supervised learning remains superior when sufficient annotated data is available. The authors note that the ambiguity of a single relation type accounts for most of the performance difference. Future work may focus on refining relation definitions or combining few-shot LLM approaches with small amounts of supervised data to bridge the remaining gap.