Review form for CSEM-reviews   2003
====================================

Paper #5, "Local Scheduling Technique for Memory Coherence in a Clustered VLIW
    Processor wit a Distributed Data Cache" by Enric Gibert, Jesus Sanchez,
    Antonio Gonzalez

Reviewer: Andrzej Bednarski

Short summary
-------------

The paper presents a software solution to the problem of data instruction
inconsistency for homogeneous clustered VLIW architectures with hierarchical
memory model. The authors present two static scheduling algorithms (MDC and
DDGT) that assure data consistency of resulting code. Both alogorithms are
evaluated and analyzed using Mediabench benchmark suite. The authors also
compare benefits of both approaches.

The goal is to increase local hit ratio and keep execution time low. In order
to decrease the amount of remote hits (i.e. improve the local hit ratio), a
solution consists in adding attraction buffers (hardware solution) to the
architecture. This implies modification to the initial algorithms (provided by
the authors) to cope with additional copies of data, and thus adding new
possible inconsistencies.

The main contributions
----------------------

The main contribution is a new scheduling algorithm for clustered
architectures with distributed data cache that does not require hardware
support. Evaluation of two different strategies show a decrease in number of
memory stalls. Further, the authors analyzed performance bottleneck and try to
characterize situation for which proposed algorithm is more suitable: this is
left for future work.

Merits and weaknesses
---------------------

Merits:
+ Address the loop issues, that is central in classical DSP programs.
+ Optimizing technique: time but similarly power, by increasing local hit
  ration.
+ A software solution.

Weaknesses:
- The paper is not stand alone. The reader needs to brows the authors'
  previous publication for deeper understanding. This is related for instance
  to the profiling information that is crucial for MDC strategy for placing
  instruction on specific clusters.
- In the result section, the authors compare both methods relative to each
  other. There is no base line comparison which would actually show the
  improvement/degradation on the resulting code.
- Attraction buffers appear to be sort of patch to improve numerical
  evaluation. Such improvements are not realizable with existing architectures.

Numerical rating
----------------

* Significance: 7
* Originality: 8
* Interest to a journal on programming languages and compiler technology: 8
* Quality of experimental evaluation: 8
* Overall organization: 8
* Presentation (language and style): 7
* Length appropriate: 8
* References appropriate: 8

* Overall evaluation: 8
* Recommendation: Accept
* Your confidence in your review: 6

Comments to the authors
-----------------------

The paper is well organized and entertaining. I specially liked the pseudo
code that helps significantly to understand authors' approach.

Both algorithms operate on intermediate representation by adding edges into
the graph. The authors should mention which scheduling strategy do they
use. Adding edges into the DDG decreases the freedom of the scheduler, which
limits the number of valid schedules that respect data dependencies, and
memory coherence.

Suggestions for improvement
---------------------------

The last part feels like a "patch" for improving performance. The authors
opt for a hardware solution to decrease the number of remote hits. It improves
the results, but unfortunately does not add merits to authors' work. 
The authors propose an adopted version of algorithm (DDGT) for
architecture with attraction buffers.

Pseudo code for MDC would be appreciated as well.

The authors do not provide information how do they perform application
profiling (the reader need to brows the previous work). This holds as well for
the heuristics issue that are only named in this paper. 

Minor remark: for readability, Figure 5 should include the delays, as in Figure 3.