Top three mistakes new data scientists need to avoid
Data science as a discipline has a brief history but a bright future. The label for this particular field of specialization has caught on in the past ten years after being popularized by technologists at LinkedIn and Facebook. Today, the field is attracting thousands of professionals who are fascinated with the idea of extracting insights from data. Because of the potential for innovation and competitive advantage, it’s no surprise that businesses are eager to find reliable ways to derive value from Big Data. So far, the results of data science projects have been mixed—mostly because it’s easy to get things wrong. Kristen Kehrer, Chief Data Scientist at Constant Contact, offered her perspective on current best practices and what may be possible in the future.
Data scientists must be educators within the enterprise
Like most data science professionals, working cross-functionally and providing analytics for different departments gave Kehrer an understanding of many areas of business. But expecting other departments to grasp what data science was all about proved unrealistic. This specialized profession required explanation. Kristen found herself faced time and again with the opportunity to educate a non-technical audience about the capabilities of data science and how things really worked. “People not in the field have to be introduced to testing best practices and how different models work. Certain assumptions need to be met for different models to be interpreted correctly, and best approaches for different data types all need to be taken into consideration along with the business problem that you’re trying to solve.”
Fortunately, the conversation about best practices begins with something that business decision makers can readily understand: practical value. It should be the primary factor in determining whether any data science project gets the green light. “The most important thing to consider at the beginning of the project is that the outcome needs to be actionable. I think many organizations have had the experience where a model is built and then people say, ‘That’s great, but then what do we do with it?’ I won’t work on anything unless I know that I’m going to be able to either have it productionalized or have it automatically upload data to a database that other people can leverage to build targeting campaigns, or for some other usage. It needs to push the needle for our customers.”
Avoid three common mistakes in modeling and testing
Data science seems complex, and it can be. However, the basic principles are simple to grasp once they’ve been explained. Kristen mentioned several areas where she often finds herself educating a business audience about best practices. Here were her top three tips when it comes to testing.
- Don’t muddy the waters. Change one variable at a time. “You have to make sure that your test is going to learn what you’re hoping to learn, be clear about this upfront, and structure a test in a way that analysts will be able to read out on it. People like to change two things in the test, and then we can’t read on which item actually made a difference. We only have aggregate results.”
- Oversimplification has its own hazards. The time-honored A/B split test is one example. Being able to combine more factors and see how they compare in a variety of combinations can yield greater insight. “There’s a benefit in full factorial test design and how we can learn about interactions, and we’re not losing information in doing iterative A/B testing. If you do an MVT (multivariate test), you get to keep all that information. If you do an A/B and then another A/B, the loser of the first might have been the winner if it had been in combination with something that was in the second test.” Multivariate is not appropriate in every scenario, but when appropriate, it offers greater opportunity for learning about interactions.
- Teasing out the details requires a different approach to modeling—especially when the data is being sliced up in specific ways. “When you have a model, how do you test to see if it’s working? That involves some uplift modeling. You can’t just take the lowest deciles of the model and run a test and analyze it the normal way you’d analyze a typical A/B test where people are randomized—because you’re unlikely to see an uplift. It’s already been determined that these people are less likely to take certain actions. How do you really get at that incremental lift?”
Data science is going to get easier—while it remains exciting
As data science continues to grow and expand across organizations of all sizes, what does Kehrer expect to see occurring in the next five years? “I think there will be huge strides in automated machine learning. It’s going to allow us to learn new things and act on them much faster. The amount of time it takes now to get a data set together, to feature engineer, build a model and deploy it, that’s all going to be much faster. We will be able to learn new things, focus on new areas, and get deep into personalization.”
Speed has already improved dramatically during Kristen’s own career, with the availability of tools like data explorer in R for exploratory data analysis and Featuretools in Python for feature engineering. “There are now libraries that allow you to do automated feature engineering. There are libraries out there that will fit a number of models and choose the best one to go with as the final model. You still need to understand the context, but these packages save a lot of time. I think utilizing these types of tools is going to become the norm and the end-to-end pipeline is going to get quicker. A lot of the things that were manual and time-intensive are getting (and are going to continue to become) easier.” To give an idea of the speed of change, Kehrer pointed out that the majority of libraries she currently uses have only been out for a few months.
That’s one of the things that makes data science such an attractive career for any technologist with an inquisitive mind and a thirst for novelty. “It doesn’t matter if you’ve been doing this job for two years or ten years, it’s a field of continuous learning. Everything is changing fast and it’s very exciting.” The frontiers of data science are beckoning. Who will answer the call?