Confessions of a data scientist
“I didn’t check the literature”
Let’s face it, we’ve all done it: we’ve spent hours trying to address a data problem only to find out that someone else had done it. I think this is a forgivable offence: data science is a vast area of knowledge, and even the most thorough of the literature reviews can miss important details. Using data right is a journey.
What’s also true, and more important, though, is the understanding that data is a multidisciplinary endeavour. This very nature makes it really hard, sometimes, to have enough experience in all its facets. What matters for a data scientist is their ability to learn the right language to engage, collaborate, and work together with other disciplines.
“I didn’t understand the customer expectations”
No piece of work should ever be started without first asking that most classic of the “GDS” questions: “what’s the user need?”. And I’d extend that question to also ask “and what’s the user capability?” This is because, even if we focus on addressing a user problem, they might not have the right skills to be able to exploit the solution. I’m seeing this on a daily basis is my current role leading the AI Skunkworks in the NHS: digital and data maturity varies across the system.
Data scientists can have a tendency to forget that their models are only as good as the user understanding of how they work; the user must be able to relate them back to the problem they’re experiencing, for example by being able to measure how the data-driven solution impacts on it.
I like football metaphors — think about Total Football — data is a discipline in which the scientist needs to learn to play other roles, like community engagement and user research, or they won’t be able to fully understand the context of the problem and the ability of their solution to really address it.
“I do data science on spreadsheets”
There’s nothing wrong with it, to be frank. What’s wrong is using any technology blindly, which can lead sometimes to nefarious outcomes. This doesn’t happen because spreadsheets are intrinsically bad; it happens because we’re using them either in ways they were not intended to be used, or without understanding their limitations and capabilities.
Rather than speaking about spreadsheets as evil and R pipelines as good, I’d move the conversation to be about quality control. Any data process requires a strong understanding of how to check its quality: how to check that the input data is correct, or that mistyped records are not creating unexpected results. Sometimes, a spreadsheet is all we need to check the quality of data, and often we don’t check it well enough because we’re lazy.
Instead of getting into wars of religion on what is the best platform, focus on the outcome and set out rules that allow problems in the data and its analysis to be picked up. Of course, moving on from spreadsheets to code is often a good move if it is well understood; but moving alone won’t improve the quality of your analysis if you don’t think quality first. And if you’re doing this for others, resist the temptation to just disrupt — disruption only works well if you hold hands with the people whose work you’re disrupting and bring them along the journey with you.
“The implementation is difficult and not interesting”
Prototyping is exciting and gets to a solution very quickly. Data scientists love prototyping their models. You might ask why I use the word “prototyping”, and that’s because if models needs to become part of a larger application — which is what the vast majority of data science models are intended to do —those models will be prototypes until they’re fully tested in the real environment. And I’ve seen way too many data scientists thinking “deployment is not my problem”. It is.
This issue links up to what I was saying earlier: think about which needs you’re trying to address and understand that the right model is a model that can be used. Understanding the infrastructure, liaising early with your IT department (via your data engineers, if your team is large enough to have both data scientists and engineers), and discussing the capabilities available is key to the success of every model. And, last but not least, ask the key question of every data science application: “who’s going to maintain the model and avoid model decay?”. It is your problem, as there’s not such thing as “handing over” of models to production teams without the continuing oversight of a data scientist .
“I used some bad science”
Let’s face it, this stuff is difficult and, as I said earlier, the areas of knowledge that intersect with data science are vast. In order to succeed, a data scientist needs to be both an expert and a generalist, which is rather demanding to say the least. The only thing I want to stress here is that things go wrong, it happens to even the best scientists — keep an open mind and review your steps, your models, and keep communicating openly about them. Don’t forget the most fundamental lesson: often, data doesn’t offer answers, it offers the ability to find questions.
“I’ve recruited people with the wrong skill set”
The reality of data science is that sometimes you’ll need an expert in a very specific type of technology, e.g. NLP, or vision AI, or GANs, but most of the time you’ll need a generalist who’s able to think about problems and find the best matching technology or method. Sadly, it’s hard to cover all bases, especially in the public sector (for a variety of reasons; financials, to begin with, but not exclusively). The reality of who’s possible to hire might clash with the immediate needs of the organisation or face competition from the incredibly hyped market out there.
The key to success, if you’re a manager, is to be honest with yourself, your team, and your superiors on what is achievable and what the route to delivery is; it is to keep your team motivated and willing to explore the unknown while feeling safe in their position; it is to see the reality of daily data science as a process to navigate, allowing teams to learn and develop as new needs emerge — because they will emerge, and very frequently, as data is an immature and changing topic. Pragmatically, when recruiting, I tend to manage expectations by avoiding overly narrow job titles; for example, we once decided to do without a “data engineer” and opted for the broader title of “data technologist”. This turned out to be one of my best hires — someone with a broad expertise and a willingness to quickly learn new tech and put it to good use.
Perfect teams don’t exist. Your job as a manager is that of a tuner: making sure that, over time, the team broadly matches and delivers according to the business needs, and making it evolve as much as possible so that the match keeps working. There’s excitement in delivery and your job as a manager is to keep your team excited and able to deliver.
“I no longer believe in data science”
Hey, this is me! :-) No, I’m kidding. The thing is, there should be no “belief” involved in any of this. Data science is an instrument, a set of tools to address a set of problems. Don’t ever make the mistake to elevate a set of tools to something it isn’t, on a philosophical level. All sets of tools evolve to stay suitable to the work they’re meant to deliver, and sometimes you find a new set of tools that delivers better. Job titles change, roles evolve, and fashions ofter determine what’s sexy. Disciplines merge, clash, and change as a consequence — what we call “digital” today is fundamentally the meeting (not always on equal footing) of good old customer service, communications, and I.T. Stay away from wars of religion, don’t bring your baggage around if it’s too heavy, and learn something from other disciplines, it will give you a broader angle of view. And don’t forget that there were no silver bullets, and still there aren’t.