LLMs Struggle on Privacy-Constrained Industrial Tabular Data, Study Finds

A new study from arXiv compares large language models (LLMs) with classical machine learning on an industrial car retrofit prediction task, finding that while LLMs have niche uses, tree ensembles remain superior. The research highlights that on privacy-constrained tables, LLMs are more effective as complementary components than replacements.

iGEN Editorial

June 16, 2026

LLMs Struggle on Privacy-Constrained Industrial Tabular Data, Study Finds

Industrial retrofit planning relies on structured operational data, not free text. Planners must estimate whether a newly registered prototype will require a retrofit, which package it will need, and how long the work will take. A study on arXiv, titled "LLMs on Tabular Data with Limited Semantics: Evidence from Industrial Car Retrofit Prediction," examines this challenge using a real-world industrial dataset. The researchers compared strong tabular machine learning baselines with three LLM-based strategies on row-serialized inputs.

The dataset links a prototype-registration system (284,271 vehicles) with a retrofit-management system (48,716 cleaned visits). The tasks include binary occurrence prediction, 15-way retrofit-type classification, per-visit duration regression, and an aggregated monthly benchmark.

LLM Strategies Compared

The study evaluated three LLM approaches:

Embedding features using Amazon Titan
Direct prompted classification using Claude Sonnet 4
ML+LLM stacking (a hybrid approach)

These were pitted against classical tree ensembles and other tabular methods.

Key Findings

The results show a clear pattern: classical tree ensembles remain the strongest standalone models. However, the LLM results reveal consistent behavior across tasks.

Strategy	Binary AUC	Multiclass Weighted F1	Notes
Embedding features (Amazon Titan)	0.982	–	Remains useful on tables
Direct prompted classification (Claude Sonnet 4)	0.500	0.018	Collapsed when semantic signal removed by hashing
Hybrid stacking (ML+LLM)	–	0.626	Best manually built multiclass model
Lag-based ML (monthly benchmark)	–	–	Outperformed time-series foundation models

On the monthly benchmark, lag-based machine learning outperformed time-series foundation models, though Chronos-small remained competitive in zero-shot forecasting.

The study notes that on privacy-constrained industrial tables, LLMs are more effective as complementary components than as replacements for strong tabular baselines. According to the paper's abstract, "the results suggest that on privacy-constrained industrial tables, LLMs are more effective as complementary components than as replacements for strong tabular baselines."

Implications for Industrial AI

For enterprise technology buyers, the insights are practical. When dealing with sensitive operational data—where semantics may be limited or hashed for privacy—LLMs used directly for classification can fail dramatically (weighted F1 of 0.018). However, embeddings can preserve useful structure (AUC 0.982), and hybrid stacking can improve multiclass predictions. The study demonstrates that for industrial tabular datasets, classical machine learning, especially tree-based ensembles, still provides the most reliable results. LLMs are best deployed as feature extractors or in ensemble with traditional models, not as standalone replacements. This aligns with the growing consensus that for structured data without rich semantic context, traditional methods remain the default choice.

Sources:

LLMs Struggle on Privacy-Constrained Industrial Tabular Data, Study Finds

LLM Strategies Compared

Key Findings

Implications for Industrial AI

Recommended Stories

New Method LUCID Detects Hallucinations in LLM-Based Knowledge Graph Reasoning

New Diagnostic for Language-Driven Bandits Determines When Lightweight Models Beat LLMs

MatchLM2Lite: Scalable MLLM-Lite Framework Cuts Reproduced Video Views by 2.5%

A Theoretical Roadmap to Fuse Foundation Models and Knowledge Graphs