There are two key schools of thought on good practice for database management: data normalization and standardization. We will learn why does each matter?
Organizations are investing heavily in technology as artificial intelligence techniques, such as machine learning, continue to gain traction across several industries.
- A Price Water Cooper Survey pointed out that 40% of business executives in 2018 make major decisions at least once every 30 days using data and this is constantly increasing
- A Gartner study states the 40% of enterprise data is either incomplete, inaccurate, or unavailable
As the speed of data coming into the business increases with the Internet of Things starting to become more mature, the risk of disconnected and siloed data grows if it is poorly managed within the organization. Gartner has suggested that a lack of data quality control costs average businesses up to $14 million per year.
The adage of “garbage in, garbage out” still plagues analytics and decision making and it is fundamental that businesses realize the importance of clean and normalized data before embarking on any such data-driven projects.
When most people talk about organizing data, they think it means getting rid of duplicates from their system which, although important, is only the first step in quality control and there are more advanced methods to truly optimize and streamline your data.
There are two key schools of thought on good practice: data normalization and standardization. Both have their place in data governance and/or preparation strategy.
Why data normalization?
A data normalization strategy takes database management and organizes it into specific tables and columns with the purpose of reducing duplication, avoiding data modification issues, and simplifying queries. All information is stored logically in one central location which reduces the propensity for inconsistent data (sometimes known as a “single source of truth”). In simple terms, it ensures your data looks and reads the same across all records.
In the context of machine learning and data science, it takes the values from the database and where they are numeric columns, changes them into a common scale. For example, imagine you have a table with two columns, and one contains values between 0 and 1 and the other contains values between 10,000 and 100,000.
The huge differences in scale might cause problems if you attempt to do any analytics or modeling. This strategy will take these two columns by creating a matching scale across all columns whilst maintaining the distribution e.g. 10,000 might become 0 and 100,000 becomes 1 with values in-between being weighted proportionality.
In real-world terms, consider a dataset of credit card information that has two variables, one for the number of credit cards and the second for income. Using these attributes, you might want to create a cluster and find similar applicants.
Both of these variables will be on completely different types of scale (income being much higher) and would therefore likely have a far greater influence on any results or analytics. Normalization removes the risk of this kind of bias.
The main benefits of this strategy in analytical terms are that it allows faster searching and sorting as it is better at creating indexes via smaller, logical tables. Also, in having more tables, there is a better use of segments to control the tangible placement of data store.
There will be fewer nulls and redundant data after modeling any necessary columns and bias/issues with anomalies are greatly reduced by removing the differences in scale.
This concept should not be confused with data standardization, and it is important that both are considered within any strategy.
What is data standardization?
Data standardization takes disparate datasets and puts them on the same scale to allow easy comparison between different types of variables. It uses the average (mean) and the standard deviation of a dataset to achieve a standardized value of a column.
For example, let’s say a store sells $520 worth of chocolate in a day. We know that on average, the store sells $420 per day and has a standard deviation of $50. To standardize the $520 we would do a calculation as follows:
520-420/50 = 100/50 = 2 – our standardized value for this day is 2. If the sales were $600, we’d scale in a similar way as 600-420/50 = 180/50 = 3.6.
If all columns are done on a similar basis, we quickly hav1`e a great base for analytics that is consistent and allows us to quickly spot correlations.
In summary, data normalization processes ensure that our data is structured logically and scaled proportionally where required, generally on a scale of 0 to 1. It tends to be used where you have predefined assumptions of your model. Data standardization can be used where you are dealing with multiple variables together and need to find correlations and trends via a weighted ratio.
By ensuring you have normalized data, the likelihood of success in your machine learning and data science projects vastly improves. It is vital that organizations invest as much in ensuring the quality of their data as they do in the analytical and scientific models that are created by it. Preparation is everything in a successful data strategy and that’s what we mainly teach in our data science bootcamp courses.