Hide menu

List of Papers

  1. AMD FUSION APU: LLANO, added 2014

  2. A Multi Banked - Multi Ported - non Blocking Shared L2 Cache for MPSoC Platforms , added 2014

  3. P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator, added 2012

  4. Bahurupi: A polymorphic heterogeneous multi-core architecture, added 2012

  5. Intel's New AES Instructions for Enhanced Performance and Security, added 2011

  6. Evaluation of a multicore reconfigurable architecture with variable core sizes, added 2010

  7. Fast and flexible pipelined multi-processor architecture for multimedia device, added 2010

  8. Reducing cache contention in a multi-core processor via a scheduler, added 2010

  9. Power7: IBM's Next-Generation Server Processor, added 2010

  10. 3D GPU architecture using cache stacking: Performance, cost, power and thermal analysis, added 2010

  11. Novel Architectures: Solving Computational Problems with GPU Computing, 2009

  12. Numerical Parallel Processing Based on GPU with CUDA Architecture, 2009

  13. Parallel processing between GPU and CPU: Concepts in a game architecture, 2007

  14. ATTILA: a cycle-level execution-driven simulator for modern GPU architectures, 2006

  15. Teodorescu, R.; Torrellas, J.
    Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors.
    35th Intl. Symp. on Computer Architecture (ISCA), pp. 363-374, 2008.


  16. Loh, G.H.
    3D-Stacked Memory Architectures for Multi-core Processors.
    35th Intl. Symp. on Computer Architecture (ISCA), pp. 453-464, 2008.


  17. Donald, J.; Martonosi, M.
    Techniques for Multicore Thermal ManagementClassification and New Exploration
    33th Intl. Symp. on Computer Architecture (ISCA), pp. 78-88, 2006.


  18. Hankins, R.A.; Chinya, G.N.; Collins, J.D.; Wang, P.H.; Rakvic, R.; Hong Wang; Shen, J.P.
    Multiple Instruction Stream Processor
    33th Intl. Symp. on Computer Architecture (ISCA), pp. 114-127, 2006.


  19. Jichuan Chang; Sohi, G.S.
    Cooperative Caching for Chip Multiprocessors
    33th Intl. Symp. on Computer Architecture (ISCA), pp. 264-276, 2006.


  20. Van Meter, R.; Munro, W.J.; Nemoto, K.; Itoh, K.M.
    Distributed Arithmetic on a Quantum Multicomputer
    33th Intl. Symp. on Computer Architecture (ISCA), pp. 354-365, 2006.


  21. Dybdahl, H.; Stenstrom, P.
    An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors
    13th Intl. Symp. on High Performance Computer Architecture (HPCA), pp. 2-12, 2007.


  22. Alameldeen, A.R.; Wood, D.A.
    Interactions Between Compression and Prefetching in Chip Multiprocessors
    13th Intl. Symp. on High Performance Computer Architecture (HPCA), pp. 228-239, 2007.


  23. Strauss, K., Shen, X., and Torrellas, J. 2006.
    Flexible SnoopingAdaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors.
    33rd Ann. Intl. Symp. on Computer Architecture (ISCA), pp. 327-338.


  24. Rabah, M., and Kanoun, K. 2003.
    Performability Evaluation of Multipurpose Multiprocessor SystemsThe “Separation of Concerns” Approach.
    IEEE Trans. Computers 52(2), pp. 223-236.


  25. Jaehyuk Huh, J., Changkyu Kim, C., Shafi, H., Lixin Zhang, L., Burger, D., and Keckler, S.W. 2007.
    A NUCA Substrate for Flexible CMP Cache Sharing.
    IEEE Trans. Parallel and Distributed Systems 18(8), pp. 1028-1040.


  26. Izadi, B.A., and Ozguner, F. 2003.
    Enhanced Cluster k-Ary n-Cube, A Fault-Tolerant Multiprocessor.
    IEEE Trans. Computers 52 (11), pp. 1443-1453.


  27. Hoseok Chang, Junho Cho, and Wonyong Sung. 2006.
    Performance Evaluation of an SIMD Architecture with a Multi-bank Vector Memory Unit.
    IEEE Work. on Signal Processing Systems Design and Implementation (SIPS), pp. 71-76.


  28. Kaneko, S., et al. 2004.
    A 600-MHz Single-Chip Multiprocessor With 4.8-GB/s Internal Shared Pipelined Bus and 512-kB Internal Memory.
    IEEE J. of Solid- State Circuits 39(1), pp. 184-193.


  29. Hoare, R., Tung, S., and Werger, K. 2004.
    An 88-way Multiprocessor within an FPGA with Customizable Instructions.
    18th Intl. Parallel and Distributed Processing Symp., pp. 258-266.


  30. Junho Cho, Hoseok Chang, and Wonyong Sung. 2006.
    An FPGA Based SIMD Processor with A Vector Memory Unit.
    IEEE Intl. Symp. on Circuits and Systems (ISCAS), 4 pp.


  31. Taylor, M.D., Lee, W., Amarasinghe, S.P., and Agarwal, A. 2005.
    Scalar Operand Networks.
    IEEE Trans. Parallel and Distributed Systems 16(2), pp. 145-162.


  32. Speight, E., Shafi, H., Lixin Zhang, and Rajamony, R. 2005.
    Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors.
    32rd Intl. Symp. on Computer Architecture (ISCA), pp. 346-356.


  33. Dunigan, T.H., Jr., Vetter, J.S., and Worley, P.H. 2004.
    Performance Evaluation of the Cray X1 Distributed Shared Memory Architecture.
    12th Ann. IEEE Symp. on High Performance Interconnects, pp. 20-25.


  34. Cvetanovic, Z. 2003.
    Performance Analysis of the Alpha 21364-based HP GS1280 Multiprocessor.
    30th Ann. Intl. Symp. on Computer Architecture (ISCA), pp. 218-228.


  35. Hwa-Joon Oh, Mueller, S.M., Jacobi, C., Tran, K.D., Cottier, S.R., Michael, B.W., Nishikawa, H., Totsuka, Y., Namatame, T., Yano, N.; Machida, T., and Dhong, S.H. 2006.
    A Fully Pipelined Single-Precision Floating-Point Unit in the Synergistic Processor Element of a CELL Processor.
    IEEE J. of Solid-State Circuits 41(4), pp. 759-771.


  36. Lu Peng, Jih-Kwon Peir, Prakash, T.K., Yen-Kuang Chen, and Koppelman, D. 2007.
    Memory Performance and Scalability of Intel’s and AMD’s Dual-Core ProcessorsA Case Study.
    IEEE Intl. Performance, Computing, and Communications Conf., pp. 55-64.


  37. Ye, T.T., and De Micheli, G. 2003.
    Physical Planning for On-Chip Multiprocessor Networks and Switch Fabrics.
    IEEE Intl. Conf. Application-Specific Systems, Architectures, and Processors (ASAP), pp. 97-107.


  38. Martin, M.M.K., Hill, M.D., and Wood, D.A. 2003.
    Token CoherenceDecoupling Performance and Correctness.
    30th Ann. Intl. Symp. on Computer Architecture (ISCA), pp. 182-193.


  39. Kodi, A.K., and Louri, A. 2004.
    A Scalable Architecture for Distributed Shared Memory Multiprocessors using Optical Interconnects.
    18th Intl. Parallel and Distributed Processing Symp. pp. 11-21.


  40. Huang, K., Grunert, D., and Thiele, L. 2007.
    Windowed FIFOs for FPGA-based Multiprocessor Systems.
    IEEE Intl. Conf. Application-Specific Systems, Architectures, and Processors (ASAP), pp. 36-41.


  41. Herbordt, M.C., Cravy, J., and Lin, C. 2003.
    Memory Considerations for High Performance SIMD Systems with On-Chip Control.
    IEEE Intl. Workshop on Computer Architectures for Machine Perception, 12 pp.


  42. Ipek, E., Mutlu, O., Martinez, J.F., and Caruana, R. 2008.
    Self-Optimizing Memory ControllersA Reinforcement Learning Approach.
    35th Ann. Intl. Symp. on Computer Architecture (ISCA), pp. 39-50.


  43. Kang, J.-Y., Gupta, S., and Gaudiot, J.-L. 2008.
    An Efficient Data-Distribution Mechanism in a Processor-In-Memory (PIM) Architecture Applied to Motion Estimation.
    IEEE Trans. Computers 57(3), pp. 375-388.


  44. Murali, S., Atienza, D., Meloni, P., Carta, S., Benini, L., De Micheli, G., and Raffo, L. 2007.
    Synthesis of Predictable Networks -on-Chip-Based Interconnect Architectures for Chip Multiprocessors.
    IEEE Trans. Very Large Scale Integration (VLSI) Systems 15(8), 869-880.


  45. Pasricha, S., Dutt, N.D., and Ben-Romdhane, M. 2007.
    BMSYNBus Matrix Communication Architecture Synthesis for MPSoC.
    IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems 26(8), pp. 1454-1464.


  46. Sun, F., Jha, N.K., Ravi, S., and Raghunathan, A. 2005.
    Synthesis of Application-specific Heterogeneous Multiprocessor Architectures using Extensible Processors.
    18th Intl. Conf. VLSI Design, pp. 551-556.


  47. Bocchi, M., de Dominicis, M., Mucci, C., Deledda, A., Campi, F., Lodi, A., Toma, M., and Guerrieri, R. 2006.
    Design and Implementation of a Reconfigurable Heterogeneous Multiprocessor SoC.
    IEEE Conf. Custom Integrated Circuits, pp. 93-96.


  48. Wun, B., and Crowley, P. 2006.
    Network I/O Acceleration in Heterogeneous Multicore Processors.
    14th IEEE Symp. on High-Performance Interconnects, pp. 9-14.


  49. Gohringer, D., Hubner, M., Schatz, V., and Becker, J. 2008.
    Runtime Adaptive Multi-Processor System-on-ChipRAMPSoC.
    IEEE Intl. Symp. on Parallel and Distributed Processing (IPDPS), pp. 1-7.


  50. Shacham, A., Bergman, K., and Carloni, L.P. 2007.
    On the Design of a Photonic Network-on-Chip.
    First Intl. Symp. on Networks-on-Chip (NOCS), pp. 53-64.


  51. Bobda, C., and Ahmadinia, A. 2005.
    Dynamic Interconnection of Reconfigurable Modules on Reconfigurable Devices.
    IEEE Design & Test of Computers 22(5), pp. 443-451.


  52. Nava, M.D., Blouet, P., Teninge, P., Coppola, M., Ben-Ismail, T., Picchiottino, S., and Wilson, R. 2005.
    An Open Platform for Developing Multiprocessor SoCs.
    IEEE Computer 38(7), pp. 60-67.


  53. Villa, F.J., Acacio, M.E., and Garcia, J.M. 2006.
    On the Evaluation of Dense Chip-Multiprocessor Architectures.
    Intl. Conf. Embedded Computer SystemsArchitectures, Modeling and Simulation (IC-SAMOS), pp. 21-27.


  54. Taeweon Suh, Daehyun Kim, and Lee, H.-H.S. 2005.
    Cache Coherence Support for Non-Shared Bus Architecture on Heterogeneous MPSoCs.
    42nd Design Automation Conference (DAC), pp. 553-558.


  55. Bonorden, O., Bruls, N., Kastens, U., Dinh Khoi Le, auf der Heide, F.M., Niemann, J.-C., Porrmann, M., Ruckert, U., Slowik, A., and Thies, M. 2003.
    A Holistic Methodology for Network Processor Design.
    28th Ann. IEEE Intl. Conf. Local Computer Networks, pp. 583-592.


  56. Manhee Lee, Minseon Ahn, and Eun Jung Kim. 2007.
    I2SEMSInterconnects-Independent Security Enhanced Shared Memory Multiprocessor Systems.
    16th Intl. Conf. Parallel Architecture and Compilation Techniques (PACT), pp. 94-103.


  57. Monchiero, M., Palermo, G., Silvano, C., and Villa, O. 2006.
    Exploration of Distributed Shared Memory Architectures for NoC-based Multiprocessors.
    Intl. Conf. Embedded Computer SystemsArchitectures, Modeling and Simulation (IC-SAMOS), pp. 144-151.


  58. Luo, Y., Laxmi Narayan Bhuyan, and Chen, X. 2003.
    Shared Memory Multiprocessor Architectures for Software IP Routers.
    IEEE Trans. Parallel and Distributed Systems 14(12), pp. 1240-1249.


  59. Jianjun Guo, Mingche Lai, Zhengyuan Pang, Libo Huang, Fangyuan Chen, Kui Dai, and Zhiying Wang. 2008.
    Memory System Design for a Multi-core Processor.
    Intl. Conf. Complex, Intelligent and Software Intensive Systems (CISIS), pp. 601-606.


  60. Lv, T., Ozer, I.B., Chakradhar, S.T., Jiang Xu, Wolf, W., and Henkel, J. 2005.
    A Methodology for Architectural Design of Multimedia Multiprocessor SoCs.
    IEEE Design & Test of Computers 22(1), pp. 18-26.


  61. Ankur Agarwal, Mustafa, M., and Pandya, A.S. 2006.
    QOS Driven Network-on-Chip Design for Real Time Systems.
    Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1291-1295.


  62. Feihui Li, Nicopoulos, C., Richardson, T., Yuan Xie, Narayanan, V., and Kandemir, M.
    Design and Management of 3D Chip Multiprocessors Using Network-in-Memory.
    33rd Intl. Symp. on Computer Architecture (ISCA), pp. 130-141.


  63. Bhuyan, L.N., and Hujun Wang. 2003.
    Switch MSHRA Technique to Reduce Remote Read Memory Access Time in CC-NUMA Multiprocessors.
    IEEE Trans. Computers 52(5), pp. 617-632.


  64. Pitter, C., and Schoeberl, M. 2008.
    Performance Evaluation of a Java Chip-Multiprocessor.
    Intl. Symp. on Industrial Embedded Systems (SIES), pp. 34-42.


  65. Hager, G., Zeiser, T., and Wellein, G. 2008.
    Data Access Optimizations for Highly Threaded Multi-Core CPUs with Multiple Memory Controllers.
    Intl. Parallel and Distributed Processing Symp., pp. 1-7.


  66. Yingmin Li, Lee, B., Brooks, D., Zhigang Hu, and Skadron, K. 2006.
    Impact of Thermal Constraints on Multi-Core Architectures.
    In Proc. 10th Intersociety Conf. on Thermal and Thermomechanical Phenomena in Electronics Systems, 8 pp.


  67. Daewook Kim, Manho Kim, and Sobelman, G.E. 2006.
    DCOSCache Embedded Switch Architecture for Distributed Shared Memory Multiprocessor SoCs.
    IEEE Intl. Symp. on Circuits and Systems (ISCAS), 4 pp.


  68. Kumar, R., Zyuban, V., and Tullsen, D.M. 2005.
    Interconnections in Multi-core ArchitecturesUnderstanding Mechanisms, Overheads and Scaling.
    32nd Intl. Symp. on Computer Architecture (ISCA), pp. 408-419.


  69. Beltran, M., and Guzman, A. 2008.
    Designing HIPAOCHigh Performance Architecture On Chip.
    Intl. Symp. on Industrial Embedded Systems (SIES), pp. 233-236.


  70. Godiwala, N., Leonard, J., and Reilly, M. 2008.
    A Network Fabric for Scalable Multiprocessor Systems.
    16th IEEE Symp. on High Performance Interconnects, pp. 137-144.


  71. Berg, E., Zeffer, H., and Hagersten, E. 2006.
    A Statistical Multiprocessor Cache Model.
    IEEE Intl. Symp. on Performance Analysis of Systems and Software, pp. 89-99.


  72. Pham, D.C., et al. 2006.
    Overview of the Architecture, Circuit Design, and Physical Implementation of a First-Generation Cell Processor.
    IEEE J. of Solid-State Circuits 41(1), pp. 179-196.


  73. Noda, H., et al. 2007.
    The Circuits and Robust Design Methodology of the Massively Parallel Processor Based on the Matrix Architecture.
    IEEE J. of Solid-State Circuits 42(4), pp. 804-812.


  74. Wang, X., and Ziavras, S.G. 2006.
    Exploiting Mixed-Mode Parallelism for Matrix Operations on the HERA Architecture through Reconfiguration.
    IEE Computers and Digital Techniques 153(4), 249-260.


  75. Chai, Lei, Gao, Qi, and Panda, D.K. 2007.
    Understanding the Impact of Multi-Core Architecture in Cluster ComputingA Case Study with Intel Dual-Core System.
    7thIEEE Intl. Symp. on Cluster Computing and the Grid, pp. 471-478.


  76. Liqun Cheng, Muralimanohar, N., Ramani, K., Balasubramonian, R., and Carter, J.B. 2006.
    Interconnect-Aware Coherence Protocols for Chip Multiprocessors.
    33rd Intl. Symp. on Computer Architecture (ISCA), pp. 339-351.


  77. Sinha, P., Sinha, A., and Basu, D. 2005.
    A Reconfigurable “SFMD Architecture” For a Class of Signal Processing Applications.
    In Proc. 7th IEEE CAS Symp. on Emerging TechnologiesCircuits and Systems for 4G Mobile Wireless Communications, pp. 46-49.


  78. Ramazani, A., Monteiro, E., Dandache, A., and Lepley, B. 2003.
    A Methodology to Design a Multimedia Processor Core.
    10th IEEE Intl. Conf. Electronics, Circuits and Systems (3), pp. 998-1001.


  79. Paver, N.C., Khan, M.H., Aldrich, B.C., and Emmons, C.D. 2003.
    Accelerating Mobile Video Applications using Intel Wireless MMX Technology.
    IEEE Workshop on Signal Processing Systems, pp. 207-212.


  80. Martin, M.M.K., Harper, P.J., Sorin, D.J., Hill, M.D., and Wood, D.A. 2003.
    Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors.
    30th Ann. Intl. Symp. on Computer Architecture (ISCA), pp. 206- 217.


  81. Gold, B.T., Kim, J., Smolens, J.C., Chung, E.S., Liaskovitis, V., Nurvitadhi, E., Falsafi, B., Hoe, J.C., and Nowatzyk, A.G. 2005.
    TRUSSa Reliable, Scalable Server Architecture.
    IEEE Micro 25(6), pp. 51-59.


  82. Frachtenberg, E., Petrini, F., Fernandez, J., and Pakin, S. 2006.
    STORMScalable Resource Management for Large-Scale Parallel Computers.
    IEEE Trans. Computers 55(12), pp. 1572-1587.


  83. Feehrer, J., Rotker, P., Shih, M., Gingras, P., Yakutis, P., Phillips, S., Heath, J., and Turullols, S. 2008.
    Coherency Hub Design for Multi-Node Victoria Falls Server Systems.
    16th IEEE Symp. on High Performance Interconnects, pp. 43-50.


  84. Wangyuan Zhang and Tao Li. 2008.
    Managing Multi-Core Soft-Error Reliability Through Utility-driven Cross Domain Optimization.
    In Proc.IEEE Intl. Conf. Application-Specific Systems, Architectures, and Processors (ASAP), pp. 132-137.


  85. Suhendra, V., and Mitra, T. 2008.
    Exploring Locking & Partitioning for Predictable Shared Caches on Multi-Cores.
    45th Design Automation Conference (DAC), pp. 300-303.


  86. Schoeberl, M. 2007.
    A Time-Triggered Network-on-Chip.
    Intl. Conf. Field Programmable Logic and Applications (FPL), pp. 377-382.


  87. Iqbal, M.M. 2008.
    Morero Cluster of Workstations (COW) First Practical Approach towards Home Grown Supercomputers in Pakistan.
    In Proc. 2nd Intl. Conf. Electrical Engineering, pp. 1-5.


  88. Karlsson, M., and Hagersten, E. 2007.
    Conserving Memory Bandwidth in Chip Multiprocessors with Runahead Execution.
    IEEE Intl. Parallel and Distributed Processing Symp., pp. 1-10.


  89. Ozturk, O., Kandemir, M., Chen, G., Irwin, M.J., and Karakoy, M. 2005.
    Customized On-Chip Memories for Embedded Chip Multiprocessors.
    of the Asia and South Pacific Design Automation Conf. (ASP-DAC), pp. 743-748.


  90. Xu, M., Thulasiraman, P., and Thulasiram, R.K. 2008.
    Exploiting Data Locality in FFT using Indirect Swap Network on Cell/B.E.
    In Proc. 22nd Intl. Symp. on High Performance Computing Systems and Applications (HPCS), pp. 88-94.


  91. Jizhu Lu, Perrone, M., Albayraktaroglu, K., and Franklin, M. 2008.
    HMMer-CellHigh Performance Protein Profile Searching on the Cell/B.E. Processor.
    IEEE Intl. Symp. on Performance Analysis of Systems and Software, pp. 223-232.


  92. Dreslinski, R.G., Bo Zhai, Mudge, T., Blaauw, D., and Sylvester, D. 2007.
    An Energy Efficient Parallel Architecture Using Near Threshold Operation.
    16th Intl. Conf. Parallel Architecture and Compilation Techniques (PACT), pp. 175-188.


  93. Dash, A., and Petrov, P. 2006.
    Energy-Efficient Cache Coherence for Embedded Multi-Processor Systems through Application-Driven Snoop Filtering.
    9th EUROMICRO Conference on Digital System Design (DSD), pp. 79-82.


  94. Francesco, P., Antonio, P., and Marchal, P. 2005.
    Flexible Hardware/Software Support for Message Passing on a Distributed Shared Memory Architecture.
    Design, Automation and Test in Europe (DATE), pp. 736-741.


  95. Shadich, R.; McLoughlin, I.V. 2005.
    A Scalable Parallel Computational Core for Embedded Processing.
    IEEE TENCON Region 10, pp. 1-6.


  96. Pasricha, S., Young-Hwan Park, Kurdahi, F.J., and Dutt, N. 2006.
    System-Level Power-Performance Trade-Offs in Bus Matrix Communication Architecture Synthesis.
    CODES+ISSS, pp. 300-305.


  97. Tseng, J.H., Hao Yu, Nagar, S., Dubey, N., Franke, H., and Pattnaik, P. 2007.
    Performance Studies of CommercialWorkloads on a Multi-core System.
    10thIEEE Intl. Symp. on Workload Characterization, pp. 57-65.


  98. Viswanath, V. 2004.
    Multi-log Processor – Towards Scalable Event-Driven Multiprocessors.
    Euromicro Symp. on Digital System Design (DSD), pp. 279-286.



Page responsible: Zebo Peng
Last updated: 2014-10-29