ARB/Topics/Task Stealing framework
We intend to work on a task-stealing based framework in VTK.
This framework works at the module level. Thus, it is orthogonal to existing dataflow parallelization efforts.
We are looking for use cases.
Measuring benfits :
- Memory consumption
- Ease of use
The implementation is currently at the design stage.
Kaapi will be used to demonstrate the framework, but other implementations could be developed (Intel TBB, Cilk, OpenMP, ...).
Task Stealing is a programming model that can be used to efficiently develop multi threaded applications, and to use hybrid resources. This model uses a few concepts that the developers have to become used to, but most of the resource management is hidden to the user.
Task Stealing description
The task stealing framework can be applied to tasks that can be recursively split into smaller tasks.
A resource executes first its local tasks (newly generated tasks are pushed in the local task queue). When a resource becomes idle, it randomly selects a victim, and if this victim has some work left, the thief steals part of it (50% usually). If no work can be stolen from this vitcim, the theif select an other victim.
This algorithm has a provable performance.
The last part of the task stealing framework is to merge all the contributions from all workers into a single output.
Application to scientific visualization
Many visualization filters tend to iterate on a data structure to execute a function on some local data. The result is a new data structure. A way to parallelize such algorithms while minimizing the impact on the code is to rely on a two phase parallel algorithm where the iteration becomes a parallel loop (interations are independant tasks) executed in parallel (work stealing takes care of splitting the iterations according to the available resources). Results produced are stored in an intermediate data structure (local to each resource when possible). A second step applies on these intermediate data structures a (parallel) reduction operator to produce the final data structure (again work stealing takes care of task scheduling during this reduction).
Community contribution is welcomed to define :
- use cases (typical work flows),
- test bench (compare speed and memory consumption between an MPI parallelized and a multi-threaded implementation),
- end-user applications (raw VTK, ParaView, ...).
Task Stealing Interface Design
The proposed API will be central to the programmer experience. The idea is to propose a task-stealing interface adapted to the most usual coding designs used in VTK, and to VTK data structures. This interface will have to be progressively defined, and will evolve (at least at the beginning).
Currently, we propose to split the API into two different levels.
High level interface
The first proposal is to expose an interface that can be used with minimal modification to the existing code. This interface can be used for algorithms that can be splitted and the generated pieces merged by a single strategy.
In order to be parallelized, the algorithm would have to explicit the parallel tasks of the algorithm.
- Contouring : the input is splitted by cell, and during the merge operation, we have to manage several pieces of surface generated by each worker.
- streamlines : the seeds can be distributed and each streamline traced independently.
Low level interface
This second level of interface will be necessary for algorithms that can not be decomposed into the three (split, parallel task, then merge) steps, or if the developer wants to provide different kernels for hydrid architectures.
In this case, the developer will have to define the tasks of his algorithm that can be parallelized within the RequestData method, the split and merge operation for each task, and the different implementations for each kind of resource.
In order to help developers to write multi-threaded algorithms, we have to improve multi-threading support in VTK. In the high level interface, the RequestData method has to be reentrant.
Several designs used in VTK are not adapted, and replacements can be progressively provided :
- caches : many algorithms have internal ivars that are used as cache during the RequestData method. A possibility would be to provide a vtkThreadSpecific<> template that could be used as a smart pointer, but that will keep one instance of the cached object per thread.
- iterators : InitTraversal() and Next() methods are still present on VTK containers. We will certainly have to replace those methods by iterator-based traversal.
- and many other nasty surprises...
The plan is to propose a generic interface for task-stealing in vtk, and two implementations to begin with :
- a sequential implementation
- a kaapi-based implementation
And then, a few algorithms will be implemented to demonstrate how to use this interface.
Community contribution would be welcomed to :
- develop another implementation of the work-stealing interface (TBB, Cilk...).
- modify the existing algorithms to use the task-stealing framework.
- use the task-stealing framework in the Execution model.