# COMP SCI 3306 Assignment 3: Frequent Itemsets, Clustering

Assignment 3: Frequent Itemsets, Clustering,
Advertising
Formative, Weight (15%), Learning objectives (1, 2, 3),
Abstraction (4), Design (4), Communication (4), Data (5), Programming (5)
Due date: 11 : 59pm, 3 June, 2019
1 Overview
Read the following carefully as it differs from the last assignment.
For students who are taking the course COMP SCI 3306 (i.e., undergraduate
students), this assignment can be done in groups consisting of two students. If
you have problems finding a group partner use the forum to search for group
partners or contact the lecturer.
For other students who are taking the course COMP SCI 7306, this assignment
should be done individually.
References to sections, examples, etc. refer to the book of “Leskovec, Rajaraman
and Ullman: Mining Massive Datasets (Second Edition)”.
2 Assignment
Exercise 1 Frequent Itemsets (15+15+10+10 points)
For this exercise, you have to read Section 6.4 up to 6.4.3.
1. Implement the simple, randomized algorithm given in 6.4.1
2. Implement the algorithm of Savasere, Omiecinski, and Navathe (SON algorithm)
in 6.4.3
3. Compare the two algorithms on the datasets T10I4D100K, T40I10D100K,

COMP SCI 3306作业代做、Clustering留学生作业代做、代写Java/c++
chess, connect, mushroom, pumsb, pumsb star provided at
http://fimi.ua.ac.be/data/
and report the outcomes.
1
COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2019
4. Experiment with dierent sample sizes in the simple randomized algorithm
such as 1, 2, 5, 10% and compare your results (including the result produced
by the SON algorithm).
Your approach should be as efficient as possible in terms of runtime and
memory requirements.
Report on challenges that you might have observed in the implementation
and by running experiments.
Exercise 2 Clustering (10+20 points)
1. Perform a hierarchical clustering on the one-dimensional set of points
1, 4, 9, 16, 25, 36, 49, 64, 81.
assuming the clusters are represented by their centroid (average), and at
each step the clusters with the closest centroids are merged. (Exercise
7.2.1)
2. Implement the K-means algorithm and carry out experiments on the provided
Iris dataset.
a) You are asked to plot the K-means results by plotting the first 2 dimensions
of the input data as well as the converged centroids.
b) Provide some discussions about how you pick the value of K in K-means.
For the Iris data, only use the first 4 dimension for this exercise. In other
words, discard the label information.
Exercise 3 Advertising (Exercise 8.4.1) (10+10 points)
Consider Example 8.7. Suppose that there are three advertisers A, B, and
C. There are three queries x, y, and z. Each advertiser has a budget of 2.
Advertiser A only bids on x, B bids on x and y, and C bids on x, y, and z. Note
that on the query sequence xxyyzz, the optimal offine algorithm would yield a
revenue of 6, since all queries can be assigned.
1. Show that the greedy algorithm will assign at least 4 of the 6 queries
xxyyzz.
2. Find another sequence of queries such that the greedy algorithm can assign
as few as half the queries that the optimal offline algorithm would assign
to that sequence.
3 Procedure for handing in the assignment
Work should be handed in using Canvas. The submission should include:
2
COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2019
a PDF file of your solutions for theoretical assignments. The solutions
should contain of a detailed description of how to obtain the result.
For Exercise 2.2, you should properly provide comments in your code to
show your understanding.
all source files, all the project files.
a README.txt file containing instructions to run the code, the names,
student numbers, and email addresses of the group members.

## KDD2015,Accepted Papers

Accepted Papers by Session Research Session RT01: Social and Graphs 1Tuesday 10:20 am–12:00 pm | Level 3 – Ballroom AChair: Tanya Berger-Wolf Efficient Algorithms for Public-Private Social NetworksFlavio Chierichetti,Sapienza University of Rome; Ales

## 学好数学能让程序员的水平更高

I've been working for the past 15 months on repairing my rusty math skills, ever since I read a biography of Johnny von Neumann. I've read a huge stack of math books, and I have an even bigger stack of unread math books. And it's starting to come tog

## 【转】程序员怎样学数学

I've been working for the past 15 months on repairing my rusty math skills, ever since I read a biography of Johnny von Neumann. I've read a huge stack of math books, and I have an even bigger stack of unread math books. And it's starting to come tog

## 一篇文章看懂spark 1.3+各版本特性

Spark 1.6.x的新特性Spark-1.6是Spark-2.0之前的最后一个版本.主要是三个大方面的改进:性能提升,新的 Dataset API 和数据科学功能的扩展.这是社区开发非常重要的一个里程碑.1. 性能提升根据 Apache Spark 官方 2015 年 Spark Survey,有 91% 的用户想要提升 Spark 的性能.Parquet 性能自动化内存管理流状态管理速度提升 10X 2. Dataset APISpark 团队引入了 DataFrames,新型Datase

## Machine and Deep Learning with Python

Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstitions cheat sheet Introduction to Deep Learning with Python How to implement a neural network How to build and run your first deep learning network Neur

## Awesome Machine Learning

Awesome Machine Learning  A curated list of awesome machine learning frameworks, libraries and software (by language). Inspired by awesome-php. If you want to contribute to this list (please do), send me a pull request or contact me @josephmisiti Als

## 大数据技术词汇表

Anomaly:见异常值词条. Apache Software Foundation(ASF):专门为支持开源软件项目而办的一个非盈利性组织. ARPU(Average revenue per user):每个用户的平均收入. Artificial neural network:人工神经网络,通常简称神经网络. Avro:一个在Hadoop上的数据序列化系统,设计用于支持大批量数据交换应用. 贝叶斯分析方法(Bayesian Analysis):提供了一种计算假设概率的方法,这种方法是基于假设的

## [转] 程序员怎样学数学

Source:http://article.yeeyan.org/view/pluto/2365 --------------------------------------------------------------------- 读后感: 高中的时候数学成绩还不错,150分的卷子基本能保持在135以上.但是总感觉我的数学思维和数学修养仍然没什么提高.NUAA自招失败的经历让我彻底发现了这一点.大一学了一年的高数,又被繁杂的公式折磨得死去活来. 总感觉真正的数学不应该是这样的.但是真正的数

## 数据库系统概念笔记-引言

数据库管理系统(DBMS)由一个互相关联的数据的集合和一组用以访问这些数据的程序组成.这个数据集合通常称作数据库,其中包含了关于某个企业的信息.   DBMS的主要目标是要提供一种可以方便.高效地存取数据库信息的途径. 1.1 数据视图 1.1.1 数据抽象 一个可用的系统必须能高效地检索数据.这种高效性的需求促使设计者在数据库中使用了复杂的数据结构来表示数据,但是,有很多数据库用户不懂这些.为此,数据库的系统开发人员通过如下几个层次上的抽象来对用户屏蔽复杂性,以简化用户与系统的交互: 物理层