Data splitting is exactly what it sounds like — when data is split into multiple subsets. In the context of machine learning, one part of the split is typically used for training the machine learning models. The other parts may be used to test or validate the model once it's been fully trained.
Machine learning models are typically developed and trained from a large core of data. However, the entirety of that data set is rarely used exclusively for training purposes. While the majority is fed into the machine learning model, there's still a fairly large portion that is unused at this stage. How this data is used largely depends on what type of split you've chosen.
The two most common are the train-test split and the train-validation-test-split. The only real difference between the two is that the latter includes an additional data set used for validation. Here's how it works:
It's generally accepted that the optimal ratio of training to testing and validation is roughly 70-20-10.
As for the method by which you split your data, there isn't really anything resembling guidelines or a framework. Instead, it's largely just a matter of preference. With that in mind, there are a few different sampling methods you might use to ensure your training and testing data is split as equitably as possible:
Circling back around to the context of machine learning, data splitting helps ensure that your machine learning model functions how it's supposed to function. It also helps you avoid two of the most common issues with machine learning — overfitting and underfitting.
Overfitting occurs when a machine learning model is able to accurately parse training data, but incapable of generalizing to new data. Instead, every prediction it makes is skewed by its training data. This can occur for a few different reasons, including an insufficient data set, a data set that contains too much noise, or a data set that's too simplistic.
Underfitting, on the other hand, occurs when a machine learning model cannot give accurate results for either training data or test data. This typically occurs when a machine learning model has not been trained for enough time or has not received enough data points. Fortunately, underfitting is a lot simpler to solve than overfitting, it's more an indication that your machine learning model is slightly undercooked than anything.
Data splitting provides you with validation samples that allow you to test for both underfitting and overfitting, which in turn gives you an opportunity to address both issues.