Quality Control

Data quality

14min

If you want to test the quality of your dataset you can do so by running our data quality tests. They will give you an insight of your dataset's situation.

Document image


Grammatical Errors

Process of detectung and correcting erroneous words in a text.

Grammatical errors can serve to improve the efficiency of the model. For example, if a model is trained on a dataset full of errors, the result will be disastrous. But if there is no grammatical error, if a user introduces an error due to ignorance, the model will fail again. Therefore, maintaining a certain level of natural error is also important.

The process returns an error average value between 0  and 1 (minimum and maximum), with the possibility of setting a threshold between them.

"Detect and locate errors in the data provided. Running this test is recommended, either when the data comes from sources with informal language (social media, for example), or if error detection is a critical issue for the task to be performed (areas where accuracy is a risk factor)."

Document image


If you click on "Report Details" you will have the list of your tasks with the number of errors on each and the incorrect words marked on red on the task.

Document image


Lexical Diversity

Set of different words presented in the dataset compared to the total.

Analyzes if the text uses some words too many times. It is recommended to take it into account, either when the data comes from sources with informal language (social media, for example), or if error detection is a critical issue for the task to be performed. The greater the lexical diversity is, more text richness will have the model when training it, and will be more robust to new words in the future.

Very low lexical diversity can mean having sentences with overly-repeated words. On the other hand, having diversity too high can mean using too many synonyms in a dataset, which can make it difficult to train a model.

Document image


If you click on "Report Details" you will have a list of each task with a % of lexical density. The words marked in green are the most frequent, so if you want to increase the diversity you could switch them for synonims.

Document image




Unknown Words

Detect and locate unknown words in the provided data. It is recommended to always take it into account, except in the texts with content in different laguages, since unknown words detected do not necessarily imply a problem.

Document image


If you click on "Report Details" you will have a list of each task with a number of unknown words. The words marked in red are the unknown.

Document image




Grammatical Similarity

Will show if there are tasks grammatically similar, this means there can be identical sentences that may not benefit your model. This test returns a list of equivalent utterances with a 95% similarity. 

Document image


If you click on "Report Details" you will have a list of couples of tasks and a % of grammatical similarity between the two.

Document image


Semantic Similarity

This test will look at the semantics of the tasks. For instance, it may detect sentences grammatically different but similar by using synonyms or sentences that are on the same topic. 



Document image


If you click on "Report Details" you will have a list of couples of tasks and a % of semantic similarity between the two.

Document image


Language Detection

After you've assigned a language to your dataset, it detects any other languages that may be present in it.



Document image


If you click on "Report Details" you will have a list of tasks with different languages and the % of the foreign language on the task.

Document image


Sentence Lenght

Process that detects the average length of the sentences within the dataset.

In translation tasks, it is interesting that the length of the sentences is similar, since having sentences of very different lengths can cause problems when training a model. However, when sentiment analysis is required, having different lengths can be beneficial, as the model will have different contexts where it can learn more relations between words.

Document image




Term Frequency

This is a technique to quantify words in a set of documents. 

We compute a score for each word to signify its importance in the document and corpus. This is done by comparing the frequency of a word on a task with the use of that word in all tasks. If a word has a high frequency in a task but it has also a high frequency on the whole dataset it may not be important (words such as "the", "a", "to", etc.). On the other hand, if a high frequency word in a task has a low frequency on the dataset it can mean it's relevant.

What you can see on the chart are the most important or representative words of the data set.



Document image


If you click on "Report Details" you wil have a more extensive list of words and their importance on the dataset.

Document image


Zipf's Law

Detects if there is a possible anomaly in the frequency and distribution of words.

Zipf’s Law states that a small number of words are used many times, while the vast majority are used very rarely. The process detects if there is a possible anomaly in the frequency distribution of words, which could indicate that the text has been artificially generated, or other types of errors.

Document image


If you click on "Report Details" you get the term frequency of your dataset in two charts. On the top one, in blue, is the real frequency, the red dashed line represents the optimal frequency - Prob(r) = freq(r) / N.

On the bottom chart there's a list of the most frequent words and their frequency.

Document image










Updated 25 Mar 2024
Doc contributor
Did this page help you?