With the success of platforms like OpenAI's ChatGPT, large language models are at the front of everyone's mind. These sophisticated machine learning models are capable of a surprising amount of nuance when it comes to understanding and responding to conversational input. That sophistication isn't easy to come by, however.
A large language model (LLM) is a type of foundational model trained on a large quantity of text data, primarily used for natural language processing (NLP). Essentially LLMs and NLP together are what allows an artificial intelligence to understand and respond to human speech. Other use cases for NLP include generating written content, answering questions, and translating text and code.
Acquiring sufficient LLM training data is no simple task. Even a basic large language model is typically trained on billions of words. More sophisticated ones even pull information from various sources. You might be tempted to acquire that training data by scraping content from the web — after all, it's how OpenAI trained GPT.
Resist that temptation — because as OpenAI found out the hard way, content scraping algorithms have no way of differentiating between publicly-available information and private, proprietary or copyrighted material. Data pulled directly from the Internet also tends to be of questionable quality and efficacy. There's a reason ChatGPT appears to be so prone to hallucinations.
Instead, you'll want to look for a more official source of training data such as The Stack, which is available via Hugging Face, courtesy of the BigCode project. The Stack is a massive 2.7 terabyte library of source code in over 350 different programming languages, all licensed for use in machine learning training. Bear in mind, however, that this is only your starting point.
There are several other sources you'll want to look through for datasets, as well:
Once you've collected all your training data, you need to prepare it for use with your large language model. That includes removing stop words and junk data, tokenizing your text, and converting data sets to lower case. Depending on where you've downloaded your data sets from, you might be fortunate enough that it's already been prepped.
Next, you'll want to choose your architecture and machine learning model. Generally, you're going to want to go with the same architecture used by BERT and GPT-4, transformer. Prior to training, you'll need to configure the model by specifying hyperparameters including attention heads, loss function, and the number of layers in your transformer blocks.
Armed with pre-processed text data, your next step is to feed it to your language model. The model will be presented with a sequence of words from your training data, at which point it will attempt to predict the next word in the sequence. This will happen millions of times over the course of the training.
Note that training a large language model requires immense processing power. Instead of training the entire model at once, you distribute the model across multiple GPUs, then train each distributed part individually. This has the added benefit of being generally faster than single-processor training.
You should also be prepared to invest a great deal into your large language model — per OpenAI, it cost them $3.2 million to train ChatGPT.
Once your model is fully trained, the last step is to present it with a test dataset. Provided it passes this final assessment, you're ready to start using it. Otherwise, you may need to fine tune some of its hyperparameters, provide it with additional training, or tweak its architecture.