Big data's big question: Where are the data scientists?
- By Rutrell Yasin
- Mar 08, 2012
EDITOR'S NOTE: This article was updated Friday, March 9, 2012, to correct the spelling of Jason Yee's last name.
Even as organizations are trying to define the role of those tasked with analyzing and managing the new phenomenon of big data, people capable of that job are already projected to be in short supply.
The move from a network-centric to a data-rich environment requires a different skill set, John Marshall, CTO of the Directorate of Intelligence J2 with the Joint Chiefs of Staff, said March 6 during a forum on big data.
There is a need to aggregate data cross structured and unstructured data repositories. The biggest threat to productivity is the lack of aggregation across different lines of businesses, said Marshall, who moderated a panel on the emerging role of the chief data officer and data scientist at the Government Big Data Forum. The forum was held in Washington, D.C., by systems integrator Carahsoft.
What you need to know about big data
Big data spawns new breed of 'data scientist'
A recent study reported that shortages of qualified workers who understand the power of big data is estimated to be between 140,000 and 190,000 people by 2018, Marshall said.
Big data consists of datasets so large that conventional database management systems cannot handle them efficiently.
The Army might not be dealing with the large datasets of some IT organizations, but it is dealing with complex data, said Russell Richardson, a chief cloud architect for the Army in support of the Deputy Chief of Staff G2, which is responsible for intelligence activities for the Army. Richardson is also a senior vice president with Sotera Defense Solutions.
In his capacity with Army G2, Richardson said big data means making small data very, very large. “So the data scientist, to my mind, is responsible for managing data and [determining] how to make it useful to analysts.” That entails making decisions on what data is enriched or indexed, he said
“If we take all of the finished intelligence data from the last 50 years, it would only amount to about 500 to 600 gigabytes of data," Richardson said. "That by no means is big data." But when analysts are finished indexing that data, it takes up petabytes.
What makes the data so large? The Army is trying to create a completely pre-correlated dataset, which causes the datasets to expand. That is all right as long as analysts can get to the data quickly, Richardson said.
The task of a data scientist here would be to expand and enrich the data so users trying to get value from it can have the data in the response time that is acceptable for their missions. This is a field industry and government haven’t really trained people for, he said. Right now, the Army, for instance, is trying to be clever by figuring out how to reduce storage and heuristically index datasets.
The term “data scientist” is a broad term akin to the word doctor, which describes a general practitioner. However, seeing a neurologist is different from seeing a cardiologist, said Michael Lazar, a senior solutions architect with VMware who previously worked in the intelligence community.
There is a need for multiple disciplinary specialists who can deal with the range of different data — structured data, pre-text data, video and imaging, Lazar said. People will specialize in subsets. For instance, you might need an expert to do a better job on video traction, he said.
“It is not about data; it is about what is the meaningful information for that line of business,” Lazar said. The line of business could be gathering intelligence for the warfighter or the manager running an agency who has to figure out the most cost-effective programs to eliminate because of budget cuts.
New cadres of students are coming out of college with data analytics and data mining skills, Lazar said, but the public sector should consider growing people internally to take on the tasks of analyzing and managing data. Organizations that train people from within tend to keep them longer, he said.
“I think that the concept of data scientist is too broad right now," said Matt Schumpert, director of solutions engineering at Datameer, Inc. " I don’t think you need the word ‘scientist’ in the title to use big data." The word is a barrier to entry into data analytics, especially when organizations are so dependent on data now.
There are the true scientists doing predictive modeling and data mining who need tools, he said. Datameer is working with others in the industry to promote the Predictive Modeling Markup Language, a common format that will allow interchange of data between different modeling tools.
Schumpert said the role of the data officer is becoming more important in data-driven companies. The goal of the data officer is to treat data as a corporate asset, avoid data silos, and ensure the use of original data versus that which is prepared by applications or services.
A text mining company such as Digital Reasoning needs data scientists, said Jason Yee, mission support engineer for the company. A data scientist needs to be a subject-matter expert, a mathematician and an application developer. It does take computing language skills to do things in an automated way, Yee said. People with that type of talent can be found in Silicon Valley in California or New York and are not likely to want to work in the public sector, he said.
The technology industry can break down the barriers that hamper organizations from managing big data by developing better application programming interfaces that make it easier for people to access and analyze data, Yee said.
Rutrell Yasin is senior editor for GCN covering cloud computing.