Making peace between IT and data science teams
Increasingly, government agencies are forming data science teams, whether in a decentralized manner across the organization, or as a centralized team. The fact that these smaller, scrappier data science groups are typically born out of a program or functional group outside the IT organization can create confusion and frustration. While the data scientists will go to the IT department for guidance, resources or permission, often the IT staff may see them as amateurs who don’t understand the way things really work. Just as likely, the data science team may consider the IT organization as either behind the times or too interested in process to move quickly.
A few clarifications can help IT personnel better understand and collaborate with a burgeoning data science team.
Data science might not mean “big data,” at least not at first.
IT staff often think about data science within the context of “big data” -- machine learning and predictive analytics on a huge scale -- and when they talk to any data science team, they expect a request for extensive storage and computational capacity. However, many data science teams (especially in the public sector) aren’t working at that scale yet: Many times, teams are working with tens of thousands -- not millions -- of data points.
IT managers should start by understanding the scale at which a data science team works before recommending an infrastructure or technology. Not every data science team will need to work at a Spark or Hadoop scale. A simpler ecosystem may be just fine for their needs and might get them started faster.
Open source languages really are best for data science.
Two of the most commonly used open-source languages in data science are R and Python. The draw isn’t that they are free, although that’s certainly a bonus. Both of these languages have specific libraries already developed just for data science tasks. Over time, programmers add libraries to the language, so that it continues to evolve and improve, saving data scientists time and allowing them to benefit from the work of other industry experts.
Often, when a data science team asks for access to open source programming languages, they are met with skepticism and concern and encouraged to invest in a “real” proprietary language (think SAS or SPSS). The idea is that these languages are more robust, more professional and have a longer track record, especially in government. However, for data scientists, R and Python are the industry standard and have the most cutting-edge features.
Data scientists can’t provide list of fields or variables they want from a database. They want it all.
Often, data scientists ask IT for data, and the IT team responds with a request form for particular tables and/or variables from a database. Almost always, the data science team will respond with, “I want it all.” Data scientists building models don’t know what data is of interest for their task, and in fact, the research question often flows from the data that is available, not the other way around.
Of course it’s important to enforce best practices in sharing sensitive data with only those who need access, but understanding data scientists' perspective explains why they err on the side of wanting more, not less, data.
Data science’s role in the public sector is growing, and establishing a strong partnership between IT and data science teams is critical for the success of both. By understanding a few basic data science concepts, IT teams can better support their colleagues and limit the headaches along the way.
Amy Deora leads the public sector practice at Civis Analytics.