Industrial retrofit planning relies on structured operational data, not free text. Planners must estimate whether a newly registered prototype will require a retrofit, which package it will need, and how long the work will take. A study on arXiv, titled "LLMs on Tabular Data with Limited Semantics: Evidence from Industrial Car Retrofit Prediction," examines this challenge using a real-world industrial dataset. The researchers compared strong tabular machine learning baselines with three LLM-based strategies on row-serialized inputs.
The dataset links a prototype-registration system (284,271 vehicles) with a retrofit-management system (48,716 cleaned visits). The tasks include binary occurrence prediction, 15-way retrofit-type classification, per-visit duration regression, and an aggregated monthly benchmark.
LLM Strategies Compared
The study evaluated three LLM approaches:
- Embedding features using Amazon Titan
- Direct prompted classification using Claude Sonnet 4
- ML+LLM stacking (a hybrid approach)
These were pitted against classical tree ensembles and other tabular methods.
Key Findings
The results show a clear pattern: classical tree ensembles remain the strongest standalone models. However, the LLM results reveal consistent behavior across tasks.
| Strategy | Binary AUC | Multiclass Weighted F1 | Notes |
|---|---|---|---|
| Embedding features (Amazon Titan) | 0.982 | – | Remains useful on tables |
| Direct prompted classification (Claude Sonnet 4) | 0.500 | 0.018 | Collapsed when semantic signal removed by hashing |
| Hybrid stacking (ML+LLM) | – | 0.626 | Best manually built multiclass model |
| Lag-based ML (monthly benchmark) | – | – | Outperformed time-series foundation models |
On the monthly benchmark, lag-based machine learning outperformed time-series foundation models, though Chronos-small remained competitive in zero-shot forecasting.
The study notes that on privacy-constrained industrial tables, LLMs are more effective as complementary components than as replacements for strong tabular baselines. According to the paper's abstract, "the results suggest that on privacy-constrained industrial tables, LLMs are more effective as complementary components than as replacements for strong tabular baselines."
Implications for Industrial AI
For enterprise technology buyers, the insights are practical. When dealing with sensitive operational data—where semantics may be limited or hashed for privacy—LLMs used directly for classification can fail dramatically (weighted F1 of 0.018). However, embeddings can preserve useful structure (AUC 0.982), and hybrid stacking can improve multiclass predictions. The study demonstrates that for industrial tabular datasets, classical machine learning, especially tree-based ensembles, still provides the most reliable results. LLMs are best deployed as feature extractors or in ensemble with traditional models, not as standalone replacements. This aligns with the growing consensus that for structured data without rich semantic context, traditional methods remain the default choice.