Data Strategy Drives Better Quality, Cost Savings

“If you allow people to enter data manually and you don't have any checks before it gets stored, you're going to see some really crazy stuff.”

©metamorworks/iStock/Getty Images Plus

Technologies and techniques such as robotics, machine learning, natural language processing, and blockchain present a defining opportunity for the finance function. However, companies that have not adequately invested in data will find their digital transformation efforts frustrated. 

The Financial Education & Research Foundation is working on a report with EY to help senior-level financial executives better understand data strategy and quality needed for their digital transformation. 

As part of that effort we speak with Professor Dave Waterman, an adjunct lecturer at Georgetown University and a data scientist at U.Group.

A transcript of the discussion appears below the podcast player.

FERF: What are some of the ways that low-quality data impacts an organization?

Dave Waterman: The issue with low-quality data is that it's going to take more effort to make it useful, and so there's going to be an increased time and increased cost at every step along the way. To load that data, to analyze it, to process it, to use it for making decisions, all of those things are going to be more difficult and more complicated as the data quality degrades.

As a simple example, if your neighbor's kid comes over to you and asks you to help out with her lemonade stand and asks you to look over her financials, if she comes to you with an Excel spreadsheet, you think, ‘Oh, this is great. She has her business in order. I can just look it over, tell her where she's doing well, and where she needs to improve.’ But when you start taking a closer look, you see that maybe ten percent of the fields are missing from the spreadsheet or some of the column names don't make sense. Maybe they're abbreviated and you can't figure out what they are. Or the data's not consistent. Some of the fields have numbers in them, some of them have words in them. All of those things are going to make it more difficult for you to figure out what the data is telling you.

That’s an example of the kinds of data that we see every day. It's not uncommon for me to see a data set where ten percent of the fields are missing. That doesn't mean that that data is useless. It just means that I have to do a lot more work to impute what could be there: What does that missing data mean, and how can I work around it or work with it? 

All of those things are going to slow down the work that you do. They're probably going to have to involve a lot more conversations between the involved parties. If that's a difficult communication to make, then that's going to slow you down even more. All of those things are going to result in less accuracy in your results, less clarity in your decisions, less detail in your conclusions, and so that's going to lead to missed opportunities.

FERF: You mentioned that you often see data that's incomplete or messy. What are some of the reasons why data can be of poor quality?

Waterman: The most common reason would be because it was input by a person and that there was no validation done when that data was inputted. If you allow people to enter data manually and you don't have any checks before it gets stored, you're going to see some really crazy stuff. 

You can have data validation that lets stuff through, and that can be a problem too. But, really, you need to architect your data structures from the very beginning with a knowledge and understanding of what's going to be going into those fields so that you can set clear limits so that it makes it much more difficult for individuals to enter in poor data. 

And you can still have data quality issues from data that comes automatically. Think about if you have a computer server somewhere that's running and it's making records over time. Maybe the network connection goes down, and so you lose a day's worth of data from that server. Those things happen too. Sometimes it's unavoidable. But the biggest problem that we see generally comes from when an individual has the ability to enter data on their own.

FERF: That's interesting. A lot of what we've heard in our interviews thus far is that that manual data, while it's really messy to input and it's really tough and you have to have a lot of controls around it, can also be some of the most compelling data. 

Waterman: Oh, absolutely. The closer that you are to the person who's going to be using the data and making use of the data, or the closer you are to the person who works with that data and understands it, the more informative it's going to be for you. Being able to have a direct contact with the people who are working daily with the data is going to give you a much stronger understanding of what the data really means and represents, so being able to shorten that feedback loop between the person doing the analysis and the person who's gathering that data is always great.

FERF: What are the key things to consider when formulating a data strategy?

Waterman: Some big ideas to consider are that the best time to start collecting data is always right now because, in the future, you're going to want to have as much of it as possible. You're going want it to be able to go back in time as far as possible. If you think that there's a possibility that you could make use of a certain set of data in the future, the best time to start getting that data is now. 

As you are collecting that data, you'll start seeing things and that will inform what you choose to do moving forward. Maybe you decide that you need to change your data collection strategy. Maybe you need to change your data storage strategy. Or maybe you've just discover that the data isn't going to give you the answers you thought it was. But you don't know any of those things unless you start collecting it and start seeing it. Our recommendation is always start collecting the data as soon as you can.

Along with that, the earlier in the data pipeline that you can clean the data, that you can fix errors in the data, that you can validate the data, the easier and cheaper it's going to be for you to use that data later on. Like we were talking about before, we can always go back and try to clean the data up after the fact, but it's always cheaper and faster to do it as earlier in the process as possible.

Following that idea, data has a big time cost. The further back in time you go, the more informative that that data is going to be and the more rich it will be, so if you can collect it correctly the first time around, you're going to be able to continue using it over and over as you collect more and more data. So, from a data scientist's perspective, those are the things that I'm most concerned about.

I would also think about it from a data architect perspective, which would be asking how much data are we getting, where is it coming from, where is it going to go to, and what's the most efficient way that we're going to be able to access it? The whole point of collecting the data is that we want to answer questions with it. So, being able to know the types of questions you're going to get asked from the very beginning will help you decide how to construct your storage solution and how to construct the interface that analysts and the people accessing the data are going to be able to use.

FERF: External data is currently an underutilized asset for a lot of organizations. What do they need to know when looking at incorporating more external data?

Waterman: My first thought would be about the complexities of data integration. One problem that I run into all the time is that when you have data sources collected by different organizations or by different people for different reasons, the way that that data is structured is going to be different and being able to unify those disparate data sets is going to be a challenge. 

If we're talking about time series data, maybe the way that the dates and the times are recorded in the two data sets will be different. Or if we're talking about comparing company data, maybe the way that the names of the companies are represented will be different. So I think it's important to keep in mind that any time you're incorporating an external data set, you don't have control over whether there's going to be some additional cost there in making that data sync with the data that you have on your own.

FERF: Shifting gears, data was really a part of the finance function at the beginning, but then it was shuffled into its own corner. But there's a lot of benefit to having finance employees that are fluent in data. How can companies best train their existing finance employees to have that data quality fluency?

Waterman: I am not a finance person myself, but I would imagine that the people in those positions probably already have their own intuitive understanding of data quality. If you are working with that data daily, then you already know the challenges and problems that go along with it. I think from my perspective, I wouldn't necessarily be worried about training the finance employees to recognize data quality but maybe more being able to communicate their understandings of the data with the rest of the team. The technical side of the team might have different ways of explaining the problems that they're having with data quality than the finance people would have, but they're all experiencing the same challenges. 

I would imagine that the bigger focus should be on bringing that team closer together and having them work more tightly coupled so that there is a stronger communication from both sides so that they can understand each other's perspectives better.

FERF: How does the relationship between people, technology, and process impact data quality management?

Waterman: Teamwork and office relationships are important to everything in an organization and their importance in data quality really is not any different. Analysts who feel empowered to make decisions and to communicate their findings are going to be more effective within the organization than people who feel that they don't have the power to speak up when they see things that they think need to be changes or improved. 

Creating an organization where the people who are working directly with the data, the people who are at the bottom of the food chain have the ability to communicate up and let those above them understand the challenges and strengths that they are experiencing will allow the entire organization to grow. 

It's similar to the themes that we've been talking about with these other topics where the people who are working with the data are going to be able to give you the best insight about it, and opening the channel of communication between them and the rest of the organization is going to improve everything.

FERF: How do companies establish a high-quality data culture?

Waterman: One of the important things to keep in mind is that data quality is something that has a cost, like most things in business. If you want to improve it, you have to be willing to pay for it. You are going to be changing processes, or you might need to be changing some technical support that you have. Whatever it is that you're going to be doing to create this high-quality data culture is going to require an investment from the company. There's a financial component to that investment, and there's also a professional component to that investment, which is similar to what we were just talking about where you want to be able to empower those people within your organization to be able to make the important decisions that are going to give you the opportunities to grow your data practice.

FERF: What are best practices in driving data transformation?

Waterman: With data transformation, you're talking about large-scale projects. We really always want to come back to thinking about the questions that we are trying to answer, or depending on the circumstances, the user problems that we're trying to solve with our data. You have to keep those things in mind from the very beginning because those will really inform the decisions you make about your data.

You can't just say, ‘We need to modernize our organization. We'll get the same technology that I know all of the other companies around us are using, and we'll just implement them. Everything will be up to date, and we'll be fine.’ You really have to think about your individual use case. Once you have this data, what are you going to do with it? What questions are you going to try to answer with it? That will help you drive your decision-making processes about how you decide to store it, how you decide to access it, who's going to have access to it. Do you need controls for privacy? All of those things will come out of your use case.

FERF: That's excellent. Thank you Professor Waterman.

Waterman: Great, thank you very much.