Topic
nlp
LLM-Encoded Knowledge Guides Federated Graph Recommendation to Improve Accuracy
Researchers propose a federated graph recommendation framework that leverages LLM-encoded semantic knowledge to guide cross-client structural aggregation, addressing the challenge of non-IID client data. The method consistently outperforms existing federated graph baselines on standard benchmarks.
AdaMame: New Training Recipe Solves Language Collapse in Multilingual Reasoning Models
AdaMame, a two-stage training recipe for multilingual mathematical reasoning, addresses language collapse in large reasoning models. It adaptively aligns reasoning language to the query language without compromising accuracy, achieving Pareto-optimal performance across 12 languages.
MMLongEmbed Benchmark Reveals Limitations in Long-Context Multimodal Embedding Models
MMLongEmbed is the first comprehensive benchmark for evaluating multimodal embedding models (MEMs) in long-context scenarios. It comprises four retrieval tasks covering text, document, and video modalities. The evaluation reveals that current MEMs rely heavily on superficial feature matching and struggle with deep semantic and structural dependencies, with performance degrading systematically based on context length and key information placement.
New Self-Enhanced Fine-Tuning Method Boosts Text-to-SQL Reasoning and Generalization
Researchers propose CoTE-SQL, a self-enhanced fine-tuning method that improves text-to-SQL generation by integrating reasoning traces, structured chain-of-thought prompting, and execution error correction. The approach achieves state-of-the-art results on Bird and Spider benchmarks, particularly on complex queries.
New Method Resolves Drift Attribution Ambiguity in LLM Evaluation Pipelines
A research paper introduces an anytime-valid attribution method for LLM evaluation pipelines that resolves the ambiguity between product drift and judge model changes. Using a fixed human-labeled anchor set and betting e-processes, the method achieved zero misattribution on silent version bumps and correctly attributed prompt changes in 110 of 120 runs, while the industry-default rolling z-test false-alarmed on 75% of drift-free streams.
EHRNote-ChatQA: New Benchmark Tests LLMs on Multi-Turn Clinical Question Answering
Researchers introduce EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over multiple discharge summaries. Built from MIMIC-IV data, it contains 967 patient-level samples and 16,072 QA pairs, revealing that LLMs struggle more with evidence grounding than content answering and that multi-turn errors compound.
Koshur Diacritizer: A Byte-Level Model Restores Diacritics for Kashmiri Language NLP
Researchers have developed Koshur Diacritizer, a byte-level sequence-to-sequence model based on ByT5-small, to restore missing diacritic marks in Kashmiri digital text. The model, trained on 23,700 sentence pairs, achieves a DERm of 0.2012 and word error rate of 0.2159, with a native expert accuracy of 77.5%. The dataset, model, and source code are publicly released to support low-resource language research.
Researchers Tackle Annotator Disagreement to Improve Hate Speech Classification Accuracy
A new research paper from Dehghan, Sen, and Yanikoglu explores the challenge of annotator disagreement in hate speech classification. The authors evaluate aggregation methods like majority voting and ordinal strategies, demonstrating that filtering non-consensus samples leads to over-optimistic results and that leveraging perceived hate speech strength enhances performance. They establish new state-of-the-art results for Turkish tweets.
Data Augmentations Offer Path to Efficient Language Model Pretraining Under Data Constraints
As AI labs face a data ceiling where compute capacity outpaces new high-quality text, researchers propose data augmentations to enable productive multi-epoch training on fixed corpora. Three categories—token-level noise, sequence permutations, and target offset prediction—are shown to delay overfitting and lower validation loss compared to standard autoregressive pretraining. Random token replacement achieved the best minimum loss among individual methods, with combined augmentations further improving results.
Few-Shot Biomedical Relation Extraction with LLMs: A Viable Alternative to Supervised Learning?
A new study on arXiv investigates few-shot biomedical relation extraction using large language models (LLMs). The best model achieved micro-F1 of 0.44, surpassing prior few-shot results but below supervised baseline. However, on macro-F1, prompt-based methods outperformed supervised learning, particularly on rare relation types, highlighting LLMs' potential in low-resource settings.