COMP SCI 3306 Assignment 3: Frequent Itemsets, Clustering

Assignment 3: Frequent Itemsets, Clustering,
Advertising
Formative, Weight (15%), Learning objectives (1, 2, 3),
Abstraction (4), Design (4), Communication (4), Data (5), Programming (5)
Due date: 11 : 59pm, 3 June, 2019
1 Overview
Read the following carefully as it differs from the last assignment.
For students who are taking the course COMP SCI 3306 (i.e., undergraduate
students), this assignment can be done in groups consisting of two students. If
you have problems finding a group partner use the forum to search for group
partners or contact the lecturer.
For other students who are taking the course COMP SCI 7306, this assignment
should be done individually.
References to sections, examples, etc. refer to the book of “Leskovec, Rajaraman
and Ullman: Mining Massive Datasets (Second Edition)”.
2 Assignment
Exercise 1 Frequent Itemsets (15+15+10+10 points)
For this exercise, you have to read Section 6.4 up to 6.4.3.
1. Implement the simple, randomized algorithm given in 6.4.1
2. Implement the algorithm of Savasere, Omiecinski, and Navathe (SON algorithm)
in 6.4.3
3. Compare the two algorithms on the datasets T10I4D100K, T40I10D100K,

COMP SCI 3306作业代做、Clustering留学生作业代做、代写Java/c++
chess, connect, mushroom, pumsb, pumsb star provided at
http://fimi.ua.ac.be/data/
and report the outcomes.
1
COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2019
4. Experiment with dierent sample sizes in the simple randomized algorithm
such as 1, 2, 5, 10% and compare your results (including the result produced
by the SON algorithm).
Your approach should be as efficient as possible in terms of runtime and
memory requirements.
Report on challenges that you might have observed in the implementation
and by running experiments.
Exercise 2 Clustering (10+20 points)
1. Perform a hierarchical clustering on the one-dimensional set of points
1, 4, 9, 16, 25, 36, 49, 64, 81.
assuming the clusters are represented by their centroid (average), and at
each step the clusters with the closest centroids are merged. (Exercise
7.2.1)
2. Implement the K-means algorithm and carry out experiments on the provided
Iris dataset.
a) You are asked to plot the K-means results by plotting the first 2 dimensions
of the input data as well as the converged centroids.
b) Provide some discussions about how you pick the value of K in K-means.
For the Iris data, only use the first 4 dimension for this exercise. In other
words, discard the label information.
Exercise 3 Advertising (Exercise 8.4.1) (10+10 points)
Consider Example 8.7. Suppose that there are three advertisers A, B, and
C. There are three queries x, y, and z. Each advertiser has a budget of 2.
Advertiser A only bids on x, B bids on x and y, and C bids on x, y, and z. Note
that on the query sequence xxyyzz, the optimal offine algorithm would yield a
revenue of 6, since all queries can be assigned.
1. Show that the greedy algorithm will assign at least 4 of the 6 queries
xxyyzz.
2. Find another sequence of queries such that the greedy algorithm can assign
as few as half the queries that the optimal offline algorithm would assign
to that sequence.
3 Procedure for handing in the assignment
Work should be handed in using Canvas. The submission should include:
2
COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2019
a PDF file of your solutions for theoretical assignments. The solutions
should contain of a detailed description of how to obtain the result.
For Exercise 2.2, you should properly provide comments in your code to
show your understanding.
all source files, all the project files.
a README.txt file containing instructions to run the code, the names,
student numbers, and email addresses of the group members.

因为专业,所以值得信赖。如有需要,请加QQ:99515681 或邮箱:[email protected]

微信:codinghelp

原文地址:https://www.cnblogs.com/cibc/p/11011530.html

时间: 06-12

COMP SCI 3306 Assignment 3: Frequent Itemsets, Clustering的相关文章

COMP SCI 3004/7064 - Operating Systems Assignment

COMP SCI 3004/7064 - Operating Systems Assignment 2DUE: 23:30pm, 28th Oct, 2019Important Notes• Handins:– The deadline for submission of your assignment is 23:30pm, 28th Oct, 2019.– For undergraduate students, you may do this assignment as a team of

COMP SCI 2ME3, SFWR ENG

Assignment 4COMP SCI 2ME3, SFWR ENG 2AA4March 29, 2019Assigned: March 22, 2019Spec and Code: April 9, 2019Last Revised: March 29, 2019All submissions are made through git, using your own repo located at:https://gitlab.cas.mcmaster.ca/se2aa4 cs2me3 as

COMP SCI 3004/7064

COMP SCI 3004/7064 - Operating Systems Assignment 1Important Notes Handins:– For undergraduate students, you may do this assignment as a team of two studentsand hand in one submission per team.– For postgraduate students, you have to do this assignme

COMP SCI 3004/7064 - Operating Systems

COMP SCI 3004/7064 - Operating Systems Assignment 2DUE: 23:30pm, 28th Oct, 2019Important Notes• Handins:– The deadline for submission of your assignment is 23:30pm, 28th Oct, 2019.– For undergraduate students, you may do this assignment as a team of

C++ and OO Num. Comp. Sci. Eng. - Part 2.

本文参考自<C++ and Object-Oriented Numeric Computing for Scientists and Engineers>. 1. Basic Types 在 C++ 中,变量的声明不必像 C 和 Fortran 一样放在程序最前方,可以在变量使用前声明,增加程序可读性. C++ 中有布尔类型变量,关键字为 bool. C++ 中标准库内 numeric_limits 模板函数可以返回不同类型变量的最大值与最小值. C++ 内标示符要求由字符.数字与下划线组成,

C++ and OO Num. Comp. Sci. Eng. - Part 5.

类 class 关键字提供了一种包含机制,将数据和操作数据的方法结合到一起,作为内置类型来使用. 类可以包含私有部分,仅其成员和 friend 类访问,公有部分可以在程序中任意位置处访问. 构造函数与类重名.析构函数用来定义对象销毁时操作. class pt2d { // class for 2D points private: // private members double x; // x coordinate double y; // y coordinate public: // pu

论文总结(Frequent Itemsets Mining With Differential Privacy Over Large-Scale Data)

一.论文目标:将差分隐私和频繁项集挖掘结合,主要针对大规模数据. 二.论文的整体思路: 1)预处理阶段: 对于大的数据集,进行采样得到采样数据集并计算频繁项集,估计样本数据集最大长度限制,然后再缩小源数据集:(根据最小的support值,频繁项集之外的项集从源数据集移除)     我们利用字符串匹配去剪切数据集的事务: 2)挖掘阶段: 利用压缩数据集,先构造FP-Tree,隐私预算均匀分配,对真实的结果添加噪声: 3)扰动阶段: 对于候选频繁项集添加拉普拉斯噪声并且输出 通过限制每个事务的长度减

Machine Learning: Clustering &amp; Retrieval机器学习之聚类和信息检索(框架)

Case Studies: Finding Similar DocumentsLearning Outcomes:  By the end of this course, you will be able to:(通过本章的学习,你将掌握)   -Create a document retrieval system using k-nearest neighbors.用K近邻构建文本检索系统   -Identify various similarity metrics for text data

2008 SCI 影响因子(Impact Factor)

Excel download 期刊名缩写 影响因子 ISSN号 CA-CANCER J CLIN 74.575 0007-9235 NEW ENGL J MED 50.017 0028-4793 ANNU REV IMMUNOL 41.059 0732-0582 NAT REV MOL CELL BIO 35.423 1471-0072 PHYSIOL REV 35.000 0031-9333 REV MOD PHYS 33.985 0034-6861 JAMA-J AM MED ASSOC 3