Quality Control
Data quality

14min
If you want to test the quality of your dataset you can do so by running our data quality tests. They will give you an insight of your dataset's situation. 
﻿
Grammatical Errors
Process of detectung and correcting erroneous words in a text. 
Grammatical errors can serve to improve the efficiency of the model. For example, if a model is trained on a dataset full of errors, the result will be disastrous. But if there is no grammatical error, if a user introduces an error due to ignorance, the model will fail again. Therefore, maintaining a certain level of natural error is also important.
The process returns an error average value between 0  and 1 (minimum and maximum), with the possibility of setting a threshold between them.
"Detect and locate errors in the data provided. Running this test is recommended, either when the data comes from sources with informal language (social media, for example), or if error detection is a critical issue for the task to be performed (areas where accuracy is a risk factor)."
﻿
If you click on "Report Details" you will have the list of your tasks with the number of errors on each and the incorrect words marked on red on the task.
﻿
Lexical Diversity
Set of different words presented in the dataset compared to the total.
Analyzes if the text uses some words too many times. It is recommended to take it into account, either when the data comes from sources with informal language (social media, for example), or if error detection is a critical issue for the task to be performed. The greater the lexical diversity is, more text richness will have the model when training it, and will be more robust to new words in the future.
Very low lexical diversity can mean having sentences with overly-repeated words. On the other hand, having diversity too high can mean using too many synonyms in a dataset, which can make it difficult to train a model.
﻿
If you click on "Report Details" you will have a list of each task with a % of lexical density. The words marked in green are the most frequent, so if you want to increase the diversity you could switch them for synonims. 
﻿
﻿
Unknown Words
Detect and locate unknown words in the provided data. It is recommended to always take it into account, except in the texts with content in different laguages, since unknown words detected do not necessarily imply a problem.
﻿
If you click on "Report Details" you will have a list of each task with a number of unknown words. The words marked in red are the unknown.
﻿
﻿
Grammatical Similarity
Will show if there are tasks grammatically similar, this means there can be identical sentences that may not benefit your model. This test returns a list of equivalent utterances with a 95% similarity. 
﻿
If you click on "Report Details" you will have a list of couples of tasks and a % of grammatical similarity between the two. 
﻿
Semantic Similarity
This test will look at the semantics of the tasks. For instance, it may detect sentences grammatically different but similar by using synonyms or sentences that are on the same topic. 
﻿
﻿
If you click on "Report Details" you will have a list of couples of tasks and a % of semantic similarity between the two. 
﻿
Language Detection
After you've assigned a language to your dataset, it detects any other languages that may be present in it.
﻿
﻿
If you click on "Report Details" you will have a list of tasks with different languages and the % of the foreign language on the task. 
﻿
Sentence Lenght
Process that detects the average length of the sentences within the dataset.
In translation tasks, it is interesting that the length of the sentences is similar, since having sentences of very different lengths can cause problems when training a model. However, when sentiment analysis is required, having different lengths can be beneficial, as the model will have different contexts where it can learn more relations between words.
﻿
﻿
Term Frequency
This is a technique to quantify words in a set of documents. 
We compute a score for each word to signify its importance in the document and corpus. This is done by comparing the frequency of a word on a task with the use of that word in all tasks. If a word has a high frequency in a task but it has also a high frequency on the whole dataset it may not be important (words such as "the", "a", "to", etc.). On the other hand, if a high frequency word in a task has a low frequency on the dataset it can mean it's relevant. 
What you can see on the chart are the most important or representative words of the data set.
﻿
﻿
If you click on "Report Details" you wil have a more extensive list of words and their importance on the dataset.
﻿
Zipf's Law
Detects if there is a possible anomaly in the frequency and distribution of words.
Zipf’s Law states that a small number of words are used many times, while the vast majority are used very rarely. The process detects if there is a possible anomaly in the frequency distribution of words, which could indicate that the text has been artificially generated, or other types of errors.
﻿
If you click on "Report Details" you get the term frequency of your dataset in two charts. On the top one, in blue, is the real frequency, the red dashed line represents the optimal frequency - Prob(r) = freq(r) / N.
On the bottom chart there's a list of the most frequent words and their frequency. 
﻿
﻿
﻿
﻿
﻿
Updated 25 Mar 2024
Did this page help you?
Labeling Quality