DF21500 Multicore Computing

Papers for Student Presentations (23/3/2011)

Choose one paper from the following list for presentation and another (preferably related) one for opposition.


  1. D. Cederman, Ph. Tsigas:
    Supporting Lock-Free Composition of Concurrent Data Objects.
    Proceedings of the 7th ACM conference on Computing frontiers, pp. 53-62. 2010.

  2. Maged M. Michael, Michael L. Scott:
    Non-Blocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors
    J. Parallel and Distr. Comput. 51(1): 1-26, May 1998.


  4. Maurice Herlihy, J. Eliot B. Moss:
    Transactional memory: Architectural support for lock-free data structures.
    Proc. ISCA'93 20th Int. Symp. on Computer Architecture, pp. 289-300, 1993.

  5. M. Ansari et al.:
    Steal-on-Abort: Improving Transactional Memory Performance through Dynamic Transaction Reordering
    Proc. HiPEAC-2009


  7. Eduard Ayguade et al.:
    The Design of OpenMP Tasks.
    IEEE Trans. on Par. and Distr. Syst. 20(3), March 2009.

  8. T. D. Han and T. Abdelrahman:
    hiCUDA: High-Level GPGPU Programming.
    IEEE Trans. Par. Distr. Syst. 22(1), Jan. 2011.

  9. P. Charles et al.:
    X10: An ObjectOriented Approach to NonUniform Cluster Computing
    Proc. OOPSLA-2005

  10. Ganesh Bikshandy et al.:
    Design and Use of htalib - A Library for Hierarchically Tiled Arrays.
    Proc. LCPC-2006, Springer LNCS 4382:17-32, 2008.

  11. Q. Hou et al.:
    BSGP: Bulk-synchronous GPU programming.
    ACM Trans. Graph. 27(3), article 19, 2008.

  12. R. Chen et al.:
    Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling
    Proc. PACT-2010 Conference, ACM.


  14. Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier.
    STARPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures.
    In Proc. Euro-Par 2009, LNCS 5704, pp. 863-874, 2009.

  15. Cédric Augonnet, Samuel Thibault, and Raymond Namyst.
    Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures.
    In Proceedings of the International Euro-Par Workshops 2009, HPPC'09, volume 6043 of Lecture Notes in Computer Science, Delft, The Netherlands, pages 56-65, August 2009. Springer.

  16. Guy E. Blelloch, Phillip B. Gibbons, Yossi Matias:
    Provably Efficient Scheduling for Languages with Fine-Grained Parallelism
    J. of the ACM 46(2), March 1999, pp. 281-321.

  17. M. Nijhuis et al.:
    Mapping and synchronizing streaming applications on Cell processors
    Proc. HiPEAC'09 conference, Jan. 2009

  18. H. Park et al.:
    Edge-centric modulo scheduling for coarse-grained reconfigurable architectures.
    Proc. 17th int. conference on Parallel architectures and compilation techniques (PACT), 2008.

  19. -->



  20. X. Li, M. Garzaran, D. Padua:
    A Dynamically Tuned Sorting Library
    Proc. CGO-2004

  21. N. Thomas et al.:
    A Framework for Adaptive Algorithm Selection in STAPL.
    Proc. ACM SIGPLAN Symp. Prin. Prac. Par. Prog. (PPOPP), pp. 277-288, Chicago, Illinois, Jun 2005.

  22. Markus Püschel et al.:
    SPIRAL: Code Generation for DSP Transforms
    Proceedings of the IEEE 93(2):232-275, 2005

  23. S. Williams et al.:
    PERI - Auto-tuning memory-intensive kernels for multicore
    SciDAC 2008, Journal of Physics: Conference Series 125(2008) 012038, IOP Publishing

  24. B. Jang et al.
    Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures.
    IEEE Trans. Par. Distr. Syst. 22(1), Jan. 2011.


  26. D. Cederman, Ph. Tsigas:
    GPU-Quicksort: A practical Quicksort algorithm for graphics processors.
    ACM Journal of Experimental Algorithmics, Vol. 14, Article No. 1.4, July 2009.

  27. N. Leischner, V. Osipov, P. Sanders:
    GPU Sample Sort.
    Proc. IPDPS-2010, April 2010.

  28. Scarpazza, Villa, Petrini:
    Efficient Breadth-First Search on the Cell/BE Processor
    IEEE Trans. Par. and Distr. Syst. 19(10):1381-1395, 2008.

  29. Gary J. Katz and Joseph T. Kider, Jr:
    All-pairs shortest-paths for large graphs on the GPU
    23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, 2008.

  30. A. Azevedo et al.:
    Parallel H.264 Decoding on an Embedded Multicore Processor.
    Proc. HiPEAC'09 conference, Jan. 2009


  32. S. Williams, A. Waterman, D. Patterson:
    Roofline: an insightful visual performance model for multicore architectures.
    Comm. ACM 52(4), April 2009.
    Presenter: (prel.) Thierry Vilmart, Opponent: TBA

  33. V. Adve, M. Vernon:
    Parallel program performance prediction using deterministic task graph analysis.
    ACM Trans. on Computer Systems 22(1), Feb. 2004.

  34. S. Baghsorkhi et al.:
    An adaptive performance modeling tool for GPU architectures.
    Proc. ACM PPoPP-2010.

  35. Other Optimizations and Analyses

  36. I. Sung et al.:
    Data layout transformation exploiting memory-level parallelism in structured grid many-core applications
    Proc. PACT-2010, ACM.

  37. N. Vasudevan et al.:
    Simple and fast biased locks
    Proc. PACT-2010, ACM.

  38. G. Chen and P. Stenström:
    A Methodology for Diagnosing Critical Section Bottlenecks in Multithreaded Applications.
    Proc. MULTIPROG-2011 workshop, Heraklion, Jan. 2011, pp. 21-35.

Task: Prepare a 20 minutes presentation of your chosen paper and at least 3 questions on the other paper for opposition.
After the presentation, hand in a written summary of your presented paper on 2-3 pages.

Please send me your presentation slides (ppt or pdf) for approval at least 48h before your presentation. If you do not get any reply, you can proceed and present.

This page is maintained by Christoph Kessler (chrke \at ida.liu.se)