732A54 and TDDE31 Big Data Analytics
Course information
Course literature
- Articles, chapters
- Lab assignment descriptions
- Help for Python
- Interactive Python web tutorial
- Python code visualization
- Python tutorial
- Codecademy
- Infographic on R vs Python
- Cheat sheet
- Python's assignment (or "binding") model
- ONLY 732A54: Relational databases
- Ramez Elmasri and Shamkant B Navathe, Fundamentals of Database Systems, 7th edition, 2016: chapters 3-6 and 9, section 7.1.
- SQL tutorial
- NoSQL and NewSQL (recommended reading)
- Elmasri et al.:, Fundamentals of Database Systems, 7th edition, 2016: Chapter 24; Sections 20.1-20.3; 23.1-23.4.
- Cattell: Scalable SQL and NoSQL data stores. ACM SIGMOD Record 2010, pages 12-27.
- Stonebraker: NewSQL: An Alternative to NoSQL and Old SQL for New OLTP Apps. Communications of the ACM Blog. 2011.
- Pavlo and Aslett: What's Really New with NewSQL?. ACM SIGMOD Record 2016.
- Grolinger et. al: Data management in cloud environments: NoSQL and NewSQL data stores. Journal of Cloud Computing, 2013.
- Brewer: Towards Robust Distributed Systems. Keynote talk at ACM PODC 2000.
- Vogels: Eventually Consistent. Communications of the ACM 2009, pages 40-44.
- Stonebraker et al.: "One Size Fits All": An Idea Whose Time Has Come and Gone . ICDE 2005, pages 2-11.
- Parallel processing (recommended reading)
- C. Lin, L. Snyder: Principles of Parallel Programming. Pearson/Addison Wesley, 2008. 978-0-321-54942.
- MapReduce and Hadoop (recommended reading)
- Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. Proc. OSDI, ACM, 2004. (There is also the journal version in CACM 2008, which is under 'Machine Learning' on this page.)
- Apache Hadoop: https://hadoop.apache.org
- Donald Miner and Adam Shook: MapReduce Design Patterns. O'Reilly, 2012.
- Spark (recommended reading)
- Matei Zaharia et al.: Spark: cluster computing with working sets. Proc. HotCloud'10, USENIX, 2010.
- Apache Spark: http://spark.apache.org
- A. Nandi: Spark for Python Developers. Packt Publishing, 2015.
- SparkSQL (recommended reading)
- Resource management in big-data clusters (recommended reading)
- Vinod Kumar Vavilapalli et al.: Apache Hadoop YARN: Yet Another Resource Negotiator. Proc. SoCC'13, ACM, 2013.
- Apache Hadoop YARN: https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html
- Benjamin Hindman et al.: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. Proc. NSDI'11, USENIX, 2011.
- Apache Mesos: http://mesos.apache.org/
- HDFS
- (recommended reading) Shvachko et al.: The Hadoop Distributed File System . IEEE MSST 2010, pages 1-10.
- (optional) White: Hadoop The Definitive Guide, Chapter: The Hadoop Distributed File System. 2011.
- Machine learning (recommended reading)
- Dean, J. and Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107-113, 2008.
- Chu, C.-T. et al. Map-Reduce for Machine Learning on Multicore. NIPS 19, 281-288, 2006.
- Zaharia, M. et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 12, 15-28, 2012.
- Meng, X. et al. MLlib: Machine Learning in Apache Spark. Journal of Machine Learning Research, 17(34):1-7, 2016.
Page responsible: Olaf Hartig
Last updated: 2021-04-08