Using Large Language Models in Classification Tasks: A Case Study in Credit Risk Modeling-- My Failed Attempt #1

Using Large Language Models in Classification Tasks: A Case Study in Credit Risk Modeling-- My Failed Attempt #1

With the advancement of large language models, I am curious about their implications for the industry practice of classification tasks, where traditional models like logistic regression and tree-based models currently dominate.

The data I used comes from Kaggle, specifically from a competition organised by Home Credit, a prominent player in the Southeast Asian market, particularly in the offline sector.

My initial approach is to convert tabular data into key-value pair text and let the large language model handle the rest. The major challenge I initially faced was memory usage, as the Jupyter kernel kept failing during the tokenisation step. The second challenge is my limited knowledge of transformer architecture, which led me to rely on replicating examples found online. The third challenge is slow iteration since training LLM is time and resource-consuming. The fourth challenge is that the model predicts all zeros.

I solved the first challenge by processing the results in batches and saving them to disk to prevent duplicate work. The second challenge is unavoidable and will take time to improve. The third challenge cannot be avoided due to limitations on computational resources and my knowledge of modelling efficiency.

Driven by curiosity, my questions are:

  • Using the same dataset, is a large language model (LLM) better than traditional classification methods?
  • Does fine-tuning using a specific dataset on top of the baseline model create a usable expert model?
  • Will LLM suffer from the curse of dimensionality, or can the attention mechanism effectively handle it?
  • Can LLM help with feature engineering, which is the most important task in the credit risk model?

I don’t think I can answer these questions through simple experimentation. However, it’s interesting to think about and explore these possibilities.

In the end, I managed to fine-tune the model, but the result was all zeros. The exact reason remains unclear, though imbalanced samples might be a contributing factor. To further my understanding and improve the results, I plan to delve into the existing literature on using large language models for classification tasks. Additionally, I discovered a technique called Parameter-Efficient Fine-Tuning, which can significantly reduce the computational resources required.