Hide menu

List of Papers

  1. Nikos Hardavellas
    The Rise and Fall of Dark Silicon
    USENIX, 2012.


  2. Arkaprava Basu, et al. (AMD Research)
    Software Assisted Hardware Cache Coherence for Heterogeneous Processors
    MEMSYS, 2016.


  3. Whitepaper sponsored by AMD
    HSA: A New Architecture for Heterogeneous Computing
    TIRIAS research, 2013.


  4. Joshua Ho & Ryan Smith
    NVIDIA Tegra X1 Preview & Architecture Analysis
    anandtech.com, 2015.


  5. Ryan Smith
    ARM's Mali Midgard Architecture Explored
    anandtech.com, 2015.


  6. Loi, I; Benini, L
    A Multi Banked - Multi Ported - non Blocking Shared L2 Cache for MPSoC Platforms
    Design, Automation and Test in Europe Conference and Exhibition (DATE), 2014.


  7. Shriraman, A. ; Hongzhou Zhao ; Dwarkadas, S.
    An Application-Tailored Approach to Hardware Cache Coherence
    Computer, 2013.


  8. Branover, A.; Foley, D.; Steinman, M.
    AMD FUSION APU: LLANO
    IEEE Micro, 2012.


  9. Benini, L. ; Flamand, E. ; Fuin, D. ; Melpignano, D.
    P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator
    Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012


  10. Pricopi, M; Mitra, T
    Bahurupi: A polymorphic heterogeneous multi-core architecture
    ACM Transactions on Architecture and Code Optimization (TACO), 2012.


  11. Minji Kim ; Jinyong Lee ; Younglok Kim
    Fast and flexible pipelined multi-processor architecture for multimedia device
    7th International Symposium on Communication Systems Networks and Digital Signal Processing (CSNDSP), 2010


  12. Shekofteh, S.K. ; Deldari, H. ; Khalkhali, M.B.
    Reducing cache contention in a multi-core processor via a scheduler
    3rd International Conference on Advanced Computer Theory and Engineering (ICACTE), 2010


  13. Kalla, R. ; Sinharoy, B. ; Starke, W.J. ; Floyd, M.
    Power7: IBM's Next-Generation Server Processor
    IEEE Micro, 2010.


  14. Guron, S
    Intel's New AES Instructions for Enhanced Performance and Security
    16th International Workshop, Fast Software Encryption (FSE) 2009


  15. Tuan, V.M. ; Katsura, N. ; Matsutani, H. ; Amano, H.
    Evaluation of a multicore reconfigurable architecture with variable core sizes
    IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2009.


  16. Al Maashri, A.; Guangyu Sun ; Xiangyu Dong ; Narayanan, V. ; Yuan Xie
    3D GPU architecture using cache stacking: Performance, cost, power and thermal analysis
    IEEE International Conference on Computer Design (ICCD), 2009.


  17. Cohen, J. ; Garland, M.
    Novel Architectures: Solving Computational Problems with GPU Computing
    Computing in Science & Engineering, 2009


  18. Chengming Zou; Chunfen Xia; Guanghui Zhao
    Numerical Parallel Processing Based on GPU with CUDA Architecture
    International Conference on Wireless Networks and Information Systems (WNIS), 2009


  19. Zamith, M.P.M. ; Clua, E.W.G. ; Conci, A. ; Montenegro, A.
    Parallel processing between GPU and CPU: Concepts in a game architecture
    Computer Graphics, Imaging and Visualisation (CGIV), 2007


  20. del Barrio, V.M.; Gonzalez, C. ; Roca, J. ; Fernandez, A. ; Espasa, R.
    ATTILA: a cycle-level execution-driven simulator for modern GPU architectures
    IEEE International Symposium on Performance Analysis of Systems and Software, 2006


  21. Teodorescu, R.; Torrellas, J.
    Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors.
    35th Intl. Symp. on Computer Architecture (ISCA), pp. 363-374, 2008.


  22. Loh, G.H.
    3D-Stacked Memory Architectures for Multi-core Processors.
    35th Intl. Symp. on Computer Architecture (ISCA), pp. 453-464, 2008.


  23. Hankins, R.A.; Chinya, G.N.; Collins, J.D.; Wang, P.H.; Rakvic, R.; Hong Wang; Shen, J.P.
    Multiple Instruction Stream Processor
    33th Intl. Symp. on Computer Architecture (ISCA), pp. 114-127, 2006.


  24. Jichuan Chang; Sohi, G.S.
    Cooperative Caching for Chip Multiprocessors
    33th Intl. Symp. on Computer Architecture (ISCA), pp. 264-276, 2006.


  25. Dybdahl, H.; Stenstrom, P.
    An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors
    13th Intl. Symp. on High Performance Computer Architecture (HPCA), pp. 2-12, 2007.


  26. Alameldeen, A.R.; Wood, D.A.
    Interactions Between Compression and Prefetching in Chip Multiprocessors
    13th Intl. Symp. on High Performance Computer Architecture (HPCA), pp. 228-239, 2007.


  27. Strauss, K., Shen, X., and Torrellas, J. 2006.
    Flexible SnoopingAdaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors.
    33rd Ann. Intl. Symp. on Computer Architecture (ISCA), pp. 327-338.


  28. Jaehyuk Huh, J., Changkyu Kim, C., Shafi, H., Lixin Zhang, L., Burger, D., and Keckler, S.W. 2007.
    A NUCA Substrate for Flexible CMP Cache Sharing.
    IEEE Trans. Parallel and Distributed Systems 18(8), pp. 1028-1040.


  29. Izadi, B.A., and Ozguner, F. 2003.
    Enhanced Cluster k-Ary n-Cube, A Fault-Tolerant Multiprocessor.
    IEEE Trans. Computers 52 (11), pp. 1443-1453.


  30. Hoseok Chang, Junho Cho, and Wonyong Sung. 2006.
    Performance Evaluation of an SIMD Architecture with a Multi-bank Vector Memory Unit.
    IEEE Work. on Signal Processing Systems Design and Implementation (SIPS), pp. 71-76.


  31. Hoare, R., Tung, S., and Werger, K. 2004.
    An 88-way Multiprocessor within an FPGA with Customizable Instructions.
    18th Intl. Parallel and Distributed Processing Symp., pp. 258-266.


  32. Junho Cho, Hoseok Chang, and Wonyong Sung. 2006.
    An FPGA Based SIMD Processor with A Vector Memory Unit.
    IEEE Intl. Symp. on Circuits and Systems (ISCAS), 4 pp.


  33. Taylor, M.D., Lee, W., Amarasinghe, S.P., and Agarwal, A. 2005.
    Scalar Operand Networks.
    IEEE Trans. Parallel and Distributed Systems 16(2), pp. 145-162.


  34. Speight, E., Shafi, H., Lixin Zhang, and Rajamony, R. 2005.
    Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors.
    32rd Intl. Symp. on Computer Architecture (ISCA), pp. 346-356.


  35. Dunigan, T.H., Jr., Vetter, J.S., and Worley, P.H. 2004.
    Performance Evaluation of the Cray X1 Distributed Shared Memory Architecture.
    12th Ann. IEEE Symp. on High Performance Interconnects, pp. 20-25.


  36. Hwa-Joon Oh, Mueller, S.M., Jacobi, C., Tran, K.D., Cottier, S.R., Michael, B.W., Nishikawa, H., Totsuka, Y., Namatame, T., Yano, N.; Machida, T., and Dhong, S.H. 2006.
    A Fully Pipelined Single-Precision Floating-Point Unit in the Synergistic Processor Element of a CELL Processor.
    IEEE J. of Solid-State Circuits 41(4), pp. 759-771.


  37. Lu Peng, Jih-Kwon Peir, Prakash, T.K., Yen-Kuang Chen, and Koppelman, D. 2007.
    Memory Performance and Scalability of Intel\92s and AMD\92s Dual-Core ProcessorsA Case Study.
    IEEE Intl. Performance, Computing, and Communications Conf., pp. 55-64.


  38. Ye, T.T., and De Micheli, G. 2003.
    Physical Planning for On-Chip Multiprocessor Networks and Switch Fabrics.
    IEEE Intl. Conf. Application-Specific Systems, Architectures, and Processors (ASAP), pp. 97-107.


  39. Huang, K., Grunert, D., and Thiele, L. 2007.
    Windowed FIFOs for FPGA-based Multiprocessor Systems.
    IEEE Intl. Conf. Application-Specific Systems, Architectures, and Processors (ASAP), pp. 36-41.


  40. Herbordt, M.C., Cravy, J., and Lin, C. 2003.
    Memory Considerations for High Performance SIMD Systems with On-Chip Control.
    IEEE Intl. Workshop on Computer Architectures for Machine Perception, 12 pp.


  41. Ipek, E., Mutlu, O., Martinez, J.F., and Caruana, R. 2008.
    Self-Optimizing Memory ControllersA Reinforcement Learning Approach.
    35th Ann. Intl. Symp. on Computer Architecture (ISCA), pp. 39-50.


  42. Kang, J.-Y., Gupta, S., and Gaudiot, J.-L. 2008.
    An Efficient Data-Distribution Mechanism in a Processor-In-Memory (PIM) Architecture Applied to Motion Estimation.
    IEEE Trans. Computers 57(3), pp. 375-388.


  43. Murali, S., Atienza, D., Meloni, P., Carta, S., Benini, L., De Micheli, G., and Raffo, L. 2007.
    Synthesis of Predictable Networks -on-Chip-Based Interconnect Architectures for Chip Multiprocessors.
    IEEE Trans. Very Large Scale Integration (VLSI) Systems 15(8), 869-880.


  44. Pasricha, S., Dutt, N.D., and Ben-Romdhane, M. 2007.
    BMSYNBus Matrix Communication Architecture Synthesis for MPSoC.
    IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems 26(8), pp. 1454-1464.


  45. Sun, F., Jha, N.K., Ravi, S., and Raghunathan, A. 2005.
    Synthesis of Application-specific Heterogeneous Multiprocessor Architectures using Extensible Processors.
    18th Intl. Conf. VLSI Design, pp. 551-556.


  46. Bocchi, M., de Dominicis, M., Mucci, C., Deledda, A., Campi, F., Lodi, A., Toma, M., and Guerrieri, R. 2006.
    Design and Implementation of a Reconfigurable Heterogeneous Multiprocessor SoC.
    IEEE Conf. Custom Integrated Circuits, pp. 93-96.


  47. Wun, B., and Crowley, P. 2006.
    Network I/O Acceleration in Heterogeneous Multicore Processors.
    14th IEEE Symp. on High-Performance Interconnects, pp. 9-14.


  48. Bobda, C., and Ahmadinia, A. 2005.
    Dynamic Interconnection of Reconfigurable Modules on Reconfigurable Devices.
    IEEE Design & Test of Computers 22(5), pp. 443-451.


  49. Nava, M.D., Blouet, P., Teninge, P., Coppola, M., Ben-Ismail, T., Picchiottino, S., and Wilson, R. 2005.
    An Open Platform for Developing Multiprocessor SoCs.
    IEEE Computer 38(7), pp. 60-67.


  50. Villa, F.J., Acacio, M.E., and Garcia, J.M. 2006.
    On the Evaluation of Dense Chip-Multiprocessor Architectures.
    Intl. Conf. Embedded Computer SystemsArchitectures, Modeling and Simulation (IC-SAMOS), pp. 21-27.


  51. Taeweon Suh, Daehyun Kim, and Lee, H.-H.S. 2005.
    Cache Coherence Support for Non-Shared Bus Architecture on Heterogeneous MPSoCs.
    42nd Design Automation Conference (DAC), pp. 553-558.


  52. Bonorden, O., Bruls, N., Kastens, U., Dinh Khoi Le, auf der Heide, F.M., Niemann, J.-C., Porrmann, M., Ruckert, U., Slowik, A., and Thies, M. 2003.
    A Holistic Methodology for Network Processor Design.
    28th Ann. IEEE Intl. Conf. Local Computer Networks, pp. 583-592.


  53. Manhee Lee, Minseon Ahn, and Eun Jung Kim. 2007.
    I2SEMSInterconnects-Independent Security Enhanced Shared Memory Multiprocessor Systems.
    16th Intl. Conf. Parallel Architecture and Compilation Techniques (PACT), pp. 94-103.


  54. Monchiero, M., Palermo, G., Silvano, C., and Villa, O. 2006.
    Exploration of Distributed Shared Memory Architectures for NoC-based Multiprocessors.
    Intl. Conf. Embedded Computer SystemsArchitectures, Modeling and Simulation (IC-SAMOS), pp. 144-151.


  55. Jianjun Guo, Mingche Lai, Zhengyuan Pang, Libo Huang, Fangyuan Chen, Kui Dai, and Zhiying Wang. 2008.
    Memory System Design for a Multi-core Processor.
    Intl. Conf. Complex, Intelligent and Software Intensive Systems (CISIS), pp. 601-606.


  56. Lv, T., Ozer, I.B., Chakradhar, S.T., Jiang Xu, Wolf, W., and Henkel, J. 2005.
    A Methodology for Architectural Design of Multimedia Multiprocessor SoCs.
    IEEE Design & Test of Computers 22(1), pp. 18-26.


  57. Ankur Agarwal, Mustafa, M., and Pandya, A.S. 2006.
    QOS Driven Network-on-Chip Design for Real Time Systems.
    Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1291-1295.


  58. Feihui Li, Nicopoulos, C., Richardson, T., Yuan Xie, Narayanan, V., and Kandemir, M.
    Design and Management of 3D Chip Multiprocessors Using Network-in-Memory.
    33rd Intl. Symp. on Computer Architecture (ISCA), pp. 130-141.


  59. Bhuyan, L.N., and Hujun Wang. 2003.
    Switch MSHRA Technique to Reduce Remote Read Memory Access Time in CC-NUMA Multiprocessors.
    IEEE Trans. Computers 52(5), pp. 617-632.


  60. Pitter, C., and Schoeberl, M. 2008.
    Performance Evaluation of a Java Chip-Multiprocessor.
    Intl. Symp. on Industrial Embedded Systems (SIES), pp. 34-42.


  61. Hager, G., Zeiser, T., and Wellein, G. 2008.
    Data Access Optimizations for Highly Threaded Multi-Core CPUs with Multiple Memory Controllers.
    Intl. Parallel and Distributed Processing Symp., pp. 1-7.


  62. Yingmin Li, Lee, B., Brooks, D., Zhigang Hu, and Skadron, K. 2006.
    Impact of Thermal Constraints on Multi-Core Architectures.
    In Proc. 10th Intersociety Conf. on Thermal and Thermomechanical Phenomena in Electronics Systems, 8 pp.


  63. Daewook Kim, Manho Kim, and Sobelman, G.E. 2006.
    DCOSCache Embedded Switch Architecture for Distributed Shared Memory Multiprocessor SoCs.
    IEEE Intl. Symp. on Circuits and Systems (ISCAS), 4 pp.


  64. Beltran, M., and Guzman, A. 2008.
    Designing HIPAOCHigh Performance Architecture On Chip.
    Intl. Symp. on Industrial Embedded Systems (SIES), pp. 233-236.


  65. Godiwala, N., Leonard, J., and Reilly, M. 2008.
    A Network Fabric for Scalable Multiprocessor Systems.
    16th IEEE Symp. on High Performance Interconnects, pp. 137-144.


  66. Berg, E., Zeffer, H., and Hagersten, E. 2006.
    A Statistical Multiprocessor Cache Model.
    IEEE Intl. Symp. on Performance Analysis of Systems and Software, pp. 89-99.


  67. Pham, D.C., et al. 2006.
    Overview of the Architecture, Circuit Design, and Physical Implementation of a First-Generation Cell Processor.
    IEEE J. of Solid-State Circuits 41(1), pp. 179-196.


  68. Liqun Cheng, Muralimanohar, N., Ramani, K., Balasubramonian, R., and Carter, J.B. 2006.
    Interconnect-Aware Coherence Protocols for Chip Multiprocessors.
    33rd Intl. Symp. on Computer Architecture (ISCA), pp. 339-351.


  69. Ramazani, A., Monteiro, E., Dandache, A., and Lepley, B. 2003.
    A Methodology to Design a Multimedia Processor Core.
    10th IEEE Intl. Conf. Electronics, Circuits and Systems (3), pp. 998-1001.


  70. Paver, N.C., Khan, M.H., Aldrich, B.C., and Emmons, C.D. 2003.
    Accelerating Mobile Video Applications using Intel Wireless MMX Technology.
    IEEE Workshop on Signal Processing Systems, pp. 207-212.


  71. Martin, M.M.K., Harper, P.J., Sorin, D.J., Hill, M.D., and Wood, D.A. 2003.
    Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors.
    30th Ann. Intl. Symp. on Computer Architecture (ISCA), pp. 206- 217.


  72. Frachtenberg, E., Petrini, F., Fernandez, J., and Pakin, S. 2006.
    STORMScalable Resource Management for Large-Scale Parallel Computers.
    IEEE Trans. Computers 55(12), pp. 1572-1587.


  73. Feehrer, J., Rotker, P., Shih, M., Gingras, P., Yakutis, P., Phillips, S., Heath, J., and Turullols, S. 2008.
    Coherency Hub Design for Multi-Node Victoria Falls Server Systems.
    16th IEEE Symp. on High Performance Interconnects, pp. 43-50.


  74. Wangyuan Zhang and Tao Li. 2008.
    Managing Multi-Core Soft-Error Reliability Through Utility-driven Cross Domain Optimization.
    In Proc.IEEE Intl. Conf. Application-Specific Systems, Architectures, and Processors (ASAP), pp. 132-137.


  75. Suhendra, V., and Mitra, T. 2008.
    Exploring Locking & Partitioning for Predictable Shared Caches on Multi-Cores.
    45th Design Automation Conference (DAC), pp. 300-303.


  76. Karlsson, M., and Hagersten, E. 2007.
    Conserving Memory Bandwidth in Chip Multiprocessors with Runahead Execution.
    IEEE Intl. Parallel and Distributed Processing Symp., pp. 1-10.


  77. Ozturk, O., Kandemir, M., Chen, G., Irwin, M.J., and Karakoy, M. 2005.
    Customized On-Chip Memories for Embedded Chip Multiprocessors.
    of the Asia and South Pacific Design Automation Conf. (ASP-DAC), pp. 743-748.


  78. Xu, M., Thulasiraman, P., and Thulasiram, R.K. 2008.
    Exploiting Data Locality in FFT using Indirect Swap Network on Cell/B.E.
    In Proc. 22nd Intl. Symp. on High Performance Computing Systems and Applications (HPCS), pp. 88-94.


  79. Dreslinski, R.G., Bo Zhai, Mudge, T., Blaauw, D., and Sylvester, D. 2007.
    An Energy Efficient Parallel Architecture Using Near Threshold Operation.
    16th Intl. Conf. Parallel Architecture and Compilation Techniques (PACT), pp. 175-188.


  80. Dash, A., and Petrov, P. 2006.
    Energy-Efficient Cache Coherence for Embedded Multi-Processor Systems through Application-Driven Snoop Filtering.
    9th EUROMICRO Conference on Digital System Design (DSD), pp. 79-82.


  81. Francesco, P., Antonio, P., and Marchal, P. 2005.
    Flexible Hardware/Software Support for Message Passing on a Distributed Shared Memory Architecture.
    Design, Automation and Test in Europe (DATE), pp. 736-741.


  82. Shadich, R.; McLoughlin, I.V. 2005.
    A Scalable Parallel Computational Core for Embedded Processing.
    IEEE TENCON Region 10, pp. 1-6.


  83. Pasricha, S., Young-Hwan Park, Kurdahi, F.J., and Dutt, N. 2006.
    System-Level Power-Performance Trade-Offs in Bus Matrix Communication Architecture Synthesis.
    CODES+ISSS, pp. 300-305.


  84. Tseng, J.H., Hao Yu, Nagar, S., Dubey, N., Franke, H., and Pattnaik, P. 2007.
    Performance Studies of CommercialWorkloads on a Multi-core System.
    10thIEEE Intl. Symp. on Workload Characterization, pp. 57-65.



Page responsible: Zebo Peng
Last updated: 2016-12-12