|written 5.2 years ago by||modified 5 months ago by|
|written 10 months ago by|
Data Quality Issues and assurance :-
Duplicate data :- Modern organizations face an onslaught of data from all directions – local databases, cloud data lakes, and streaming data. Additionally, they may have application and system silos. There is bound to be a lot of duplication and overlap in these sources. Duplication of contact details, for example, affects customer experience significantly. Marketing campaigns suffer if some prospects get missed out while some may get contacted again and again. Duplicate data increases the probability of skewed analytical results. As training data, it can also produce skewed ML models.
Inaccurate data - Accuracy of data plays a critical role for highly regulated industries like healthcare. Looking at the recent experience, the need to improve the quality of data for COVID-19 and subsequent pandemics is evident more than ever. Inaccurate data does not give you a correct real-word picture and cannot help plan the appropriate response. If your customer data is not accurate, personalized customer experiences disappoint, and marketing campaigns underperform.
Ambiguous data - In large databases or data lakes, some errors can creep in even with strict supervision. This situation gets more overwhelming for data streaming at high speed. Column headings can be misleading, formatting can have issues, and spelling errors can go undetected. Such ambiguous data can introduce multiple flaws in reporting and analytics.
Hidden data - Most organizations use only a part of their data, while the remaining may be lost in data silos or dumped in data graveyards. For example, customer data available with sales may not get shared with the customer service team, losing an opportunity to create more accurate and complete customer profiles. Hidden data means missing out on discovering opportunities to improve services, design innovative products, and optimize processes.
Inconsistent data - When you’re working with multiple data sources, it’s likely to have mismatches in the same information across sources. The discrepancies may be in formats, or units, or sometimes spellings. Inconsistent data can also get introduced during migration or company mergers. If not reconciled constantly, inconsistencies in data tend to build up and destroy the value of data. Data-driven organizations keep a close watch on data consistency because they want only trusted data powering their analytics.
Too much data - While we focus on data-driven analytics and its benefits, too much data does not seem to be a data quality issue. But it is. When you are looking for data relevant to your analytical projects, it’s possible to get lost in too much data. Business users, data analysts, and data scientists spend 80% of their time locating the right data and preparing it. Other data quality issues become more severe with the increasing volume of data, especially with streaming data and large files or databases.