Speakers at PP4EE 2013
List of Speakers
Hiroshi Okuda
Open-Source Parallel FE Software : FrontISTR, Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters
Abstract
FrontISTR is an open-source structural analysis system, supporting fruitful nonlinear analysis functions. FrontISTR also exhibits an innovative aspect that addresses large-scale application, parallelism, and programmability. A 7.5 billion DOF problem can be solved in 13.7 h using 65,536 cores of “K.” A single core performance is a most crucial factor in FEM, which uses iterative equation solvers, and SpMV (Sparse-Matrix Vector Product) is a hotspot there. Cache blocking and contiguous data structure for matrix have been investigated to challenge the memory wall problem. Running on a note PC, PC clusters and supercomputers including the Earth Simulator 2 and the K-computer, FrontISTR has been used for solving various industrial problems, for example, (1) Dynamic friction behaviours between rail and fast running train’s wheel, (2) Thermal structural deformation of electrical devices, (3) Thermal elastic-plastic residual stress of large-scale welded structures, (4) Friction of power transmission belt, (5) Large strain evaluation of fill rubber tire, (6) Fluid-structure coupled behavior of turbine blades, and so on.
Georgios Goumas
Alleviating memory-bandwidth limitations for scalability and energy efficiency: Lessons learned from the optimization of SpMV
Abstract
In this talk we will present our approach towards optimizing Sparse Matrix-Vector Multiplication (SpMV) on modern multicore platforms. SpMV is one of the most memory-bandwidth hungry computational kernels, heavily used in a large variety of HPC applications. To cope with this problem we propose a new online storage format for sparse matrices called Compressed Sparse eXtended (CSX). CSX applies aggressive compression to the indexing structure of sparse matrices and is able to store them with significantly reduced memory footprint. When it comes to parallel execution, the scheme achieves remarkable performance improvement and stability for a variety of matrices, both in SMP and NUMA configurations. Based on our findings on CSX, we will also discuss directions for future research on the optimization of resource demanding applications in modern execution platforms.
Magnus Jahre
The NTNU/IME focus area research project: Energy Efficient Computing Systems (EECS)
Abstract
Future computing systems are expected to be a collection of processing elements with different energy/performance characteristics due to the Dark Silicon effect. In such systems, only the subset of processing elements that maximize energy efficiency for the current application is enabled. At least two research breakthroughs are necessary for this vision to become a reality. First, we need to develop efficient software for heterogeneous systems. Second, we need to identify and implement the most efficient processing cores and accelerators as well as integrating them efficiently into the complete system. These breakthroughs can be achieved by experiments on commercially available hardware or through simulations. Unfortunately, the level of heterogeneity of commercial hardware is limited, and the performance overhead of simulation is significant.
To meet these challenges, we propose the Single-ISA Heterogeneous MAny-core Computer (SHMAC). SHMAC is an infrastructure for realizing heterogeneous computing systems from a collection of diverse, generic processing elements based on a common high-level architecture. Using reconfigurable FPGAs, it is possible to rapidly evaluate software and hardware innovations in a collection of systems that are significantly more heterogeneous than what is commercially available. In this presentation, we will focus on the motivation and implementation of SHMAC. We will also cover our future plans and how SHMAC can be leveraged in research collaborations.
Ana Varbanescu
Performance Portability in the Multi-core Era: Myths and Facts
Abstract
The “write-once-run-everywhere” programming models are still seen as a marketing trick in computer science. OpenCL, the newest such model, is no exception: proposed in 2009 as an instrument to address the portability of parallel programming over multiple multi-/many-core architectures, it has been quickly criticized for its lack of “performance portability”.
This talk is intended as a thorough discussion on performance portability. Therefore, it addresses three essential questions: (1) what is performance portability? (2) can we quantify performance portability? (3) can a programming model achieve performance portability?
We provide our vision on answering these three questions, while using the OpenCL programming model and multi-/many-cores architectures as running examples.
Javed Absar
The EU project CARP: Correct and Efficient Accelerator Programming
Abstract
Programming accelerators such as GPUs is accomplished today using low-level APIs such as OpenCL, which raises concerns from the programmer productivity and performance portability perspectives. Programmer productivity is affected because low-level APIs distract the programmer from the actual problem. Performance portability is affected because code optimized for a particular accelerator is unlikely to perform as well on another.
This talk will present a compilation flow that we have developed at ARM, along with other European Partners, which aims to address both concerns. The compilation flow includes VOBLA, a DSL that can compactly represent linear algebra operations, separating functional semantics from implementation details such as storage layouts. Any parallelism inherent in the function is not obscured by implementation details, easing parallel code generation.
VOBLA is compiled into PENCIL, a C99 based platform-neutral compute intermediate language, while retaining sufficient information for generating efficient accelerator code. PENCIL is then compiled into OpenCL code optimized for a specific accelerator using techniques based on the polyhedral model which make use of the retained information.
This is all exciting research with great benefits to GPU programming for programming-productivity and performance-portability.
Juan Manuel Cebrian
Are we Optimizing Hardware for non-optimized Applications?. PARSEC's Vectorization Effects on Energy Efficiency and Architectural Requirements
Abstract
Validation of new architectural proposals against real applications is a necessary step in academic research. However, providing benchmarks that keep up with new architectural changes has become a real challenge. If benchmarks don't cover the most common architectural features, architects may end up under/over estimating the impact of their contributions.
In this work, we extend the PARSEC benchmark suite with SIMD capabilities to provide an enhanced evaluation framework for new academic/industry proposals. We then perform a detailed energy and performance evaluation on different platforms (Intel® and ARM®) of this commonly used application set. Results show how SIMD code alters scalability, energy efficiency and hardware requirements. Performance and energy efficiency improvements depend greatly on the fraction of code that we can actually vectorize (up to 10x). We base our code in a custom built wrapper library compatible with SSE, AVX and NEON to facilitate rapid and general vectorization. We aim to distribute the source code to reinforce the evaluation process of new proposals for future computing systems.
Guillermo Miranda
OmpSs with Open CL and OmpSs/MPI
Abstract
This talk will introduce the OmpSs Programming Model, the interoperability with MPI and also with OpenCL. OmpSs enables mixing MPI code with OpenMP-like directives, improving IPC and allowing communication overlapping. OmpSs has also support for CUDA and OpenCL. Programmers can call GPU kernels without worrying about initialisation (troublesome in OpenCL), memory space management and data copying and device selection. OmpSs is able to schedule work to the available GPUs, and provides ways to run the application across all the available computing resources (CPU or accelerators).
Jan Chr. Meyer
Energy efficiency on the NTNU supercomputer Vilje
Abstract
This talk will describe the process of profiling application energy consumption on the Vilje supercomputer, using model-specific registers present in the Sandy Bridge architecture. These registers enable software to sample the energy consumption of the processor package and dynamic memory with fine granularity, but implementing access presents several challenges which are compounded in a production environment. A survey of our ongoing work to facilitate energy measurement will be presented, along with an overview of results that have been attained throughout the process.
Trond Kvamsdal
The NTNU/IME focus area research project Computational Science and Engineering (CSE)