Scientist facing limitations of IT infrastructure

6 challenges to the future of IT infrastructure

Big data will continue to present a major challenge for scientific research in the years to come, according to a white paper prepared by CERN openlab, a public-private partnership between the European Organization for Nuclear Research, known as CERN,  as well as IT companies and a number of European laboratories and researchers from the Human Brain Project.

The partners defined six major challenges covering the most crucial needs of IT infrastructures: data acquisition, computing platforms, data storage architectures, compute  provisioning and management, networks and communication, and data analytics.

The report also broke the scientific communities’ big data challenges into several categories: collecting and analyzing the data to support scientific discoveries; developing cost-effective and secure computer infrastructures for handling large amounts of data; performing accurate simulations; and sharing data across thousands of scientists and engineers.

These emerging issues require a new skill set for scientists and engineers. “It is vital that new generations of scientists and engineers are  formed  with  adequate  skills  and  expertise  in  modern  parallel  programming, statistical methods, data analysis, efficient resource utilization and a broader understanding of the possible connections across seemingly separate knowledge fields,” noted the report.

The report presents a number of use cases in different scientific and technological fields for each of the six challenge areas.

1. Data acquisition

Researchers need access to high-performance computing resources with ever larger data sets and a means of collaborating with dispersed scientific teams. However, firewalls that protect email, Web browsing and other applications can cause packet loss in the TCP/IP networks, dramatically slowing data speeds to the point of making online collaboration unviable. Routers and switches without enough high-speed memory to handle large bursts in traffic can cause the same problems.

Scientific research will require more sophisticated and flexible means to collect, filter and store data via high speed networks. The authors expect that future computing systems should be able to be rapidly reconfigured to take into account changes in theories and algorithms or to exploit idle cycles. Additionally, costs and complexity must be reduced by replacing custom electronics with high-performance commodity processors and efficient software.

2. Computing platforms

The massive amount of space and energy required to power supercomputers has been a limiting factor in growing processing power. Throughput can only be increased nowadays by exploiting multi-core platforms or new general-purpose graphical processors, the report stated, but existing software must be optimized or even redesigned to do that.

To address this issue, Sandia National Laboratories announced a project to develop new types of supercomputers with faster computing speeds at a lower cost and with less energy needs. Technologies being explored include nano-based computing, quantum computing and intelligent computing (computers that learn on their own). 

“We think that by combining capabilities in microelectronics and computer architecture, Sandia can help initiate the jump to the next technology curve sooner and with less risk,” said Rob Leland, head of Sandia’s Computing Research Center. The project, Beyond Moore Computing, addresses the plateauing of Moore’s Law, which threatens to make future computers impractical due to their enormous energy consumption.

3. Data storage architectures

Today, most physics data is still stored with custom solutions. Cloud storage architecture, such as Amazon Simple Storage Service (S3), however, may provide scalable and potentially more cost effective alternatives, the authors noted. 

The scientific community needs flexibility beyond space and cost in cloud storage options, so it can optimize storage architecture to the application. Likewise, it needs archival and long-term storage solutions.

Reliable, efficient and cost-effective data storage architectures must be designed to accommodate a variety of applications and different needs of the user community.

4. Compute management and provisioning

High-performance computing will require automation and virtualization to manage growing amounts of data, without involving proportionately more people. At the same time, the authors said, access to resources within and across different scientific infrastructures must be made secure and transparent to foster collaboration.

One way the scientific community is addressing this is through distributed systems, which divide a problem into many tasks, each of which is solved by one or more computers that communicate with each other. Grid computing, a type of distributed computing, supports computations across multiple administrative domains and involves virtualization of computing resources.

In the United States, the Open Science Grid (OSG), jointly funded by the Department of Energy and the National Science Foundation, is being used as a high-throughput grid for solving scientific problems by breaking them down into a large number of individual jobs that can run independently. In one example, OSG is being used to plan for a new high-energy electron-ion collider at Brookhaven National Laboratory.

5. Networks and connectivity

Good, reliable networking is crucial to scientific research. Optimization of data transfer requires new software-based approaches to network architecture design.  The ability to migrate a public IP address, for example, would allow application services to be moved to other hardware.  And adding intelligence to both wired and Wi-Fi networks could help the network optimize its traffic delivery to improve service and contain costs.

6. Data analytics

Finally, as data becomes too vast and diverse for humans to understand at a glance, there must be new ways to separate signal from noise and find emerging patterns, so as to continue making scientific discoveries, the authors said.

Data analytics as a service would consist of near-real-time processing, batch processing and integration of data repositories.  An ideal platform would be a standards-based, common framework that could easily transfer data between the layers and the tools, so analyses could be performed with the most appropriate solutions. Besides CERN-specific applications, these analytics would be used for industrial control systems as well as IT and network monitoring.

 "In order to get the kind of performance we need to take on new problems in science, not to mention to drive the massive amounts of data that we're all generating and using in our everyday lives, we'll need to have new kinds of technology that are much more efficient and that can be eventually manufactured at an affordable cost," Dan Olds, an analyst with The Gabriel Consulting Group told Computerworld.

Reader Comments

Tue, Jul 22, 2014 Richard

I think the biggest challenge will be to monitor and keep all the infrastructure secure and I wonder how our popular network monitoring tool Anturis, Pingdom and others will cope with the issues?

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above