This enables the automatic exploration of the implementation design space, using a set of simple rewrite rules.Īs the evaluation shows, this technique outperforms hand-written baselines and Furthark, a state of the art high performance code generator. This paper shows how to decompose a classical GPU work-efficient parallel scan in terms of other data-parallel functional primitives. However, work-efficient parallel scan is still provided as a hard-coded builtin. Lift, Furthark, Accelerate have successfully applied this technique to patterns such as parallel reduction and tiling. Performance portability is usually achieved using a generative approach, which decomposes the primitive in simpler, composable parts, expressing the implementation space.ĭata parallel functional languages excel at expressing programs as composition of simple patterns. High-performance work-efficient implementations are usually hard-coded, leading to performance portability issues.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |