General Circulation(Fall 2020) and 高级算法 (Fall 2020)/Problem Set 2: Difference between pages

From TCS Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>Etone
 
imported>TCSseminar
 
Line 1: Line 1:
{{Infobox
*每道题目的解答都要有<font color="red" size=5>完整的解题过程</font>。中英文不限。
|name        = Infobox
|bodystyle    =  
|title        = 大气环流 <br>
General Circulation of the Atmosphere
|titlestyle  =


|image        = [[File:Yang_2014_3.jpg|border|150px]]
== Problem 1 ==
|imagestyle  =  
Suppose we want to estimate the value of <math>Z</math>. Let  <math>\mathcal{A}</math> be an algorithm that outputs <math>\widehat{Z}</math> satisfying
|caption      =  
<math>\Pr[ (1-\epsilon)Z \leq \widehat{Z} \leq (1+\epsilon )Z] \geq \frac{3}{4} .</math>
|captionstyle =  
|headerstyle = background:#ccf;
|labelstyle  = background:#ddf;
|datastyle    =


|header1 =Instructor
We run <math>\mathcal{A}</math> independently for <math>s</math> times, and obtain the outputs <math>\widehat{Z}_1,\widehat{Z}_2,\ldots,\widehat{Z}_s</math>.
|label1 =
|data1  =
|header2 =
|label2  =
|data2  = 张洋
|header3 =
|label3  = Email
|data3  = yangzhang@nju.edu.cn
|header4 =
|label4= office
|data4= 仙林大气楼 B410
|header5 = Class
|label5  =
|data5  =
|header6 =
|label6  = Class meetings
|data6  = 周三 下午 1:00-4:00,仙林 逸B-307
|header7 =
|label7  = Place
|data7  =
|header8 =
|label8  = Office hours
|data8  = 周三 下午3:50-4:10 <br> 仙林 逸B-307
|header9 = Reference book
|label9  =
|data9  =
|header10 =
|label10  =
|data10  = {{Infobox
|name        =
|bodystyle  =
|title        =
|titlestyle  =
|image        = [[File:James-Circulating.jpg|border|100px]]
|imagestyle  =
|caption      = Introduction to Circulating Atmospheres, <br>''I. James'', Cambridge Press, 1995
|captionstyle = }}
|header11 =
|label11  =
|data11  = {{Infobox
|name        =
|bodystyle  =
|title        =
|titlestyle  =
|image        = [[File:Oort.jpg|border|100px]]
|imagestyle  =
|caption      =Physics of Climate, ''Peixoto, J. P.'' and ''A. H. Oort'',  Springer-Verlag New York, 1992
|captionstyle =
}}
|header12 =
|label12  =
|data12  = {{Infobox
|name        =
|bodystyle  =
|title        =
|titlestyle  =
|image        = [[File:geoff.jpg|border|100px]]
|imagestyle  =
|caption      =Atmospheric and Oceanic Fluid Dynamics: Fundamentals and Large-scale Circulation, ''Vallis, G. K.'', Cambridge University Press, 2006
|captionstyle =
}}
|belowstyle = background:#ddf;
|below =
}}


Let <math>X</math> be the '''median''' (中位数) of <math>\widehat{Z}_1,\widehat{Z}_2,\ldots,\widehat{Z}_s</math>. Find the number <math>s</math> such that <math>\Pr[ (1-\epsilon)Z \leq X \leq (1+\epsilon )Z] \geq 1 - \delta .</math>


This is the page for the class ''General Circulation of the Atmosphere (大气环流)'' for the Fall 2020 semester. Students who take this class should check this page periodically for content updates and new announcements.
Express <math>s</math> as a function of <math>\delta</math>. Make <math>s</math> as small as possible.


= Announcement =
'''Remark''': in this problem, we boost the probability of success from <math>\frac{3}{4}</math> to <math>1-\delta</math>. This method is called the median trick.
* 本学期大气环流课的上课时间更改至  <font color="red" size="2>周三下午 1:00-4:00,逸 B307 </font>。【2020.10.22】
* 本学期大气环流课期末考试的时间为  <font color="red" size="2>12月30日(周三)下午 1:00-4:00,逸 B306 </font>。【2020.12.22】


= Course info =
'''Hint''': Chernoff bound.
* '''Instructor ''': 张洋,
== Problem 2 ==
:*office: 仙林气象楼 B410
In class, we saw how to estimate the number of distinct elements in a data stream using the Flajolet-Martin algorithm. Consider the following alternative formulation of the distinct elements problem: given an <math>N</math> dimensional vector <math>x</math>, we want to process a stream of arbitrary increments to entries in <math>x</math>. In other words, if we see a number <math>i\in 1,\dots,N</math> in the stream, we update entry <math>x_i\gets x_i + 1</math>. Our goal is to estimate <math>\left \|x\right \|_0</math>, which measures the number of non-zero entries in <math>x</math>. With <math>x</math> viewed as a histogram that maintains counts for <math>N</math> potential elements, <math>\left \|x\right \|_0</math> is exactly the number of distinct elements processed.
:*email: yangzhang@nju.edu.cn
In this problem we will develop an alternative algorithm for estimating <math>\left \|x\right \|_0</math> that can also handle '''decrements''' to entries in x. Specifically, instead of the stream containing just indices <math>i</math>, it contains pairs <math>(i, +)</math> and <math>(i, −)</math>. On receiving <math>(i, +)</math>, <math>x</math> should update so that <math>x_i\gets x_i + 1</math> and on receiving <math>(i, −)</math>, <math>x</math> should update so that <math>x_i\gets x_i - 1</math>. For this problem we will assume that, at the end of our stream, each <math>x_i \ge 0</math> (i.e. for a specific index we can’t receive more decrements than increments).
* '''Class meeting''': 周三 下午 1:00-4:00, 仙林 逸B-307
* Consider a simpler problem. For a given value <math>T</math>, let’s design an algorithm that succeeds with probability <math>(1 − \delta)</math>, outputing '''LOW''' if <math>T < \frac{1}{2}\left \|x\right \|_0</math> and '''HIGH''' if <math>T > 2\left \|x\right \|_0</math>:
* '''Office hour''': 周三 <font color="red" size="2>下午3:50-4:10</font>,仙林 逸B-307
** Assume we have access to a completely random hash function <math>h(\cdot)</math> that maps each <math>i</math> to a random point in <math>[0, 1]</math>. We maintain the estimator <math>s=\sum_{i:h(i)<\frac{1}{2T}}x_i</math> as we receive increment and decrement updates. Show that, at the end of our stream,<br>(i) If <math>T < \frac{1}{2}\left \|x\right \|_0</math>, <math>Pr_h[s=0]<1/e\approx 0.37</math>;<br>(ii) If <math>T > 2\left \|x\right \|_0</math>, <math>Pr_h[s=0]>0.5</math>.
* '''Prerequisites''': 动力气象,天气学,气候学
** Using this fact, show how to use <math>k=O(\log 1/\delta)</math> independent random hash functions, and corresponding individual estimators <math>s_1, s_2, . . . , s_k</math>, to output '''LOW''' if <math>T < \frac{1}{2}\left \|x\right \|_0</math> and '''HIGH''' if <math>T > 2\left \|x\right \|_0</math>. If neither event occurs you can output either '''LOW''' or '''HIGH'''. Your algorithm should succeed with probability <math>(1 − \delta)</math>.
* '''Grading''': 平时作业(50%)+ 期末考试(50%)
* Using <math>O(\log N)</math> repetitions of your algorithm for the above decision problem (with <math>\delta</math> set appropriately), show how to obtain an estimate <math>F</math> for <math>\left \|x\right \|_0</math> such that <math>\frac{1}{4}\left \|x\right \|_0\le F\le 4\left \|x\right \|_0</math> w.h.p.(with probability <math>1-O(1/N)</math>).
本课程将大致布置4次作业,每次作业一二道题目左右。题目将选择每个课题最具有代表性、需要一定思维强度和动手能力的训练用题目,意在使学生通过顺利完成作业来建立环流系统的物理模型、以对课程内容得到深刻全面地理解和掌握。期末考试题目数量将会比平时作业多,覆盖面更广,但会比作业题目简单,只涉及对基本内容的掌握和对环流理论的直接应用。


= Course Slides =
== Problem 3 ==
''Many figures in the course slides are adapted from the reference books, NOAA and other educational sources. These figures are <font color="red" size="2">for class use only.</font>''
Let <math>U</math> be a universal set.
We use <math>2^U</math> to denote the collection of all subsets of <math>U</math>.
Let <math>\mathcal{F}</math> be a family of hash functions, in which each hash function <math>h:2^U \rightarrow \{0,1\}^m</math> maps subsets of <math>U</math> to 0-1 vectors of length <math>m</math>.
A locality sensitive hashing scheme is a distribution on a family <math>\mathcal{F}</math> of hash functions operating on <math>2^U</math>, such that for two subsets <math>A,B \in 2^U</math>,


*[2020.10.14]  [http://tcs.nju.edu.cn/yzhang/Course_intro_2020.pdf Course Introduction] and [http://tcs.nju.edu.cn/yzhang/Chap_1_1_2020.pdf Chapter 1_1]
<math>(1) \qquad \Pr_{h\in\mathcal{F}}[h(A)=h(B)]=sim(A,B).</math>


*[2020.10.21[http://tcs.nju.edu.cn/yzhang/Chap_1_2_2020.pdf Chapter 1_2] and [http://tcs.nju.edu.cn/yzhang/Chap_2_1_2020.pdf Chapter 2_1] (reference reading: PO Chapter 5.1-5.2; James Chapter 2.2, 2.4)
Here <math>sim:2^U \times 2^U \rightarrow [0,1] </math> is called the similarity function. Given a hash function family <math> \mathcal{F} </math> that satisfies Equation (1), we will say that <math>\mathcal{F}</math> is a locality sensitive hash function family corresponding to similarity function <math>sim(\cdot,\cdot)</math>.


*[2020.10.28] [http://tcs.nju.edu.cn/yzhang/Chap_2_2_2020.pdf Chapter 2_2] and [http://tcs.nju.edu.cn/yzhang/Chap_2_3_2020.pdf Chapter 2_3](reference reading: PO Chapter 4.1, 6.3, 6.7-6.8; Lindzen 2005, [http://tcs.nju.edu.cn/yzhang/lINDZEN_CHAP2.pdf Chapter 2]; Lindzen and Farrell 1977, [http://tcs.nju.edu.cn/yzhang/Lindzen_1977.pdf JAS])
*For any similarity function <math> sim(A,B)</math> that admits a locality sensitive hash function family as defined in Equation (1), prove that the distance function <math> d(A,B) \triangleq 1-sim(A,B)</math>  satisfies triangle inequality, formally,


*[2020.11.4]  [http://tcs.nju.edu.cn/yzhang/Chap_3_1_2020.pdf Chapter 3_1] (reference reading: Vallis Chapter 11.1-11.2; Lindzen 2005, [http://tcs.nju.edu.cn/yzhang/Lindzen_chap7.pdf Chapter 7]; Held and Hou, 1980, [http://tcs.nju.edu.cn/yzhang/held_hou_1980.pdf JAS])
<math> \forall A,B,C \in 2^U :\quad d(A,B) + d(B,C) \geq d(A,C).</math>


*[2020.11.11]  [http://tcs.nju.edu.cn/yzhang/Chap_3_2_2020.pdf Chapter 3_2]  (reference reading: Vallis Chapter 11.1-11.2, 11.4; Lindzen 2005, [http://tcs.nju.edu.cn/yzhang/Lindzen_chap7.pdf Chapter 7]; Held and Hou, 1980, [http://tcs.nju.edu.cn/yzhang/held_hou_1980.pdf JAS];Lindzen and Hou, 1988, [http://tcs.nju.edu.cn/yzhang/lindzen_hou_1988.pdf JAS])
*Show that there is no locality sensitive hash function family corresponding to Dice's and the Overlap coefficient. Dice's coefficient is defined as:  


*[2020.11.18]  [http://tcs.nju.edu.cn/yzhang/Chap_3_3_2020.pdf Chapter 3_3] and [http://tcs.nju.edu.cn/yzhang/Chap_4_1_2020.pdf Chapter 4_1] (reference reading: Vallis Chapter 11.4-11.7)
<math>sim_{Dice}(A,B)=\frac{2|A\cap B|}{|A| + |B|}.</math>


*[2020.11.25]  [http://tcs.nju.edu.cn/yzhang/Chap_4_2_2020.pdf Chapter 4_2] (reference reading: Vallis Chapter 6.4-6.7; [http://tcs.nju.edu.cn/yzhang/Charney47.pdf Charney 1947]; [http://tcs.nju.edu.cn/yzhang/Eady49.pdf Eady 1949])
Overlap coefficient is defined as:


*[2020.12.2]  [http://tcs.nju.edu.cn/yzhang/Chap_4_3_2020.pdf Chapter 4_3] and [http://tcs.nju.edu.cn/yzhang/Chap_5_1_1_2020.pdf Chapter 5_1_1] (reference reading: Vallis Chapter 7.1-7.3, 11.4-11.7, 12.1; PO Chapter 14.1-14.5; [http://tcs.nju.edu.cn/yzhang/EP_Edmon.pdf Edmon 1980])
<math>sim_{Ovl}(A,B) = \frac{|A\cap B|}{\min(|A|, |B|)}.</math>


*[2020.12.9]  [http://tcs.nju.edu.cn/yzhang/Chap_5_1_2_2020.pdf Chapter 5_1_2], [http://tcs.nju.edu.cn/yzhang/Chap_5_2_2020.pdf Chapter 5_2] and [http://tcs.nju.edu.cn/yzhang/Chap_5_3_2020.pdf Chapter 5_3]
<font color = "blue">Hint: use the triangle inequality result. </font>


*[2020.12.16] [http://tcs.nju.edu.cn/yzhang/Chap_6_2020.pdf Chapter 6] and [http://tcs.nju.edu.cn/yzhang/Chap_7_2020.pdf Chapter 7] (reference reading:PO Chapter 12.2-12.4, Chapter 13)
* Construct a collection of hash function <math>\mathcal{B}</math> where <math>f : \{0,1\}^m \rightarrow \{0,1\}</math> for each <math>f \in \mathcal{B}</math>, together with a probability distribution on <math>\mathcal{B}</math> such that
 
<math>\forall x, y \in \{0,1\}^m:\quad
  \Pr_{f \in \mathcal{B}}[f(x) = f(y)] = \begin{cases}
1 &\text{ if } x= y;\\
\frac{1}{2} &\text{ if } x \neq y.
\end{cases}
</math>


*[2020.12.23]  [http://tcs.nju.edu.cn/yzhang/Chap_8_2020.pdf Chapter 8] and [http://tcs.nju.edu.cn/yzhang/Review_2020.pdf Review]
Then use the hash function family <math>\mathcal{B}</math> to prove the following result.
 
Given a locality sensitive hash function family <math>\mathcal{F}</math> (<math>h:2^U \rightarrow \{0,1\}^m</math> for each <math>h \in \mathcal{F}</math>) corresponding to a similarity function <math>sim(A,B)</math>, we can obtain a locality sensitive hash function <math>\mathcal{F}'</math> (<math>h':2^U \rightarrow \{0,1\}</math> for each <math>h' \in \mathcal{F}'</math>) corresponding to the similarity function <math>\frac{1+sim(A,B)}{2}</math>.
= Assignments =
== Problem 4 ==
# [[Assignment 1, Fall 2020|Reanalysis data and the earth's climatology]] [Due:2020.11.11] <font color="red" size="1.5">请在本次作业的截止日期前将本次作业的图片部分和文字部分打包作为附件发送至邮箱 circulation_nju@126.com </font> [[Namelist_Assignment1_2020|上交作业人员名单]]
Let <math>G(V,E)</math> be an undirected graph with positive edge weights <math>w:E\to\mathbb{Z}^+</math>. Given a partition of <math>V</math> into <math>k</math> disjoint subsets <math>S_1,S_2,\ldots,S_k</math>, we define
# [[Assignment 2, Fall 2020|Simple energy balance climate model]] [Due:2020.11.25] [[Namelist_Assignment2_2020|上交作业人员名单]]
:<math>w(S_1,S_2,\ldots,S_k)=\sum_{uv\in E\atop \exists i\neq j: u\in S_i,v\in S_j}w(uv)</math>
# [[Assignment 3, Fall 2020|Hadley Cell]] [Due:2020.12.9] [[Namelist_Assignment3_2020|上交作业人员名单]]
as the cost of the '''<math>k</math>-cut''' <math>\{S_1,S_2,\ldots,S_k\}</math>. Our goal is to find a <math>k</math>-cut with maximum cost.  
# [[Assignment 4, Fall 2020|A generalized E-P flux ]] ([http://tcs.nju.edu.cn/yzhang/Chap_5_vq_ver.pdf eddy对水汽输送的空间分布图]) [Due:2020.12.16] [[Namelist_Assignment4_2020|上交作业人员名单]]
# Give a poly-time greedy algorithm for finding the weighted max <math>k</math>-cut. Prove that the approximation ratio is <math>(1-1/k)</math>.
 
# Consider the following local search algorithm for the weighted max cut ('''max 2-cut''').
= Course intro =
::Fill in the blank parenthesis. Give an analysis of the running time of the algorithm. And prove that the approximation ratio is 0.5.
“大气环流”常指地球大气较大空间范围、较长时间尺度上的空气流动,及其对地球大气热量、动量、能量和水汽的全球输送。虽然从十七、十八世纪起人们就开始研究大尺度的大气运动(如Hadley在1735年提出的信风理论),但大气环流真正发展成为一门较完备的学科方向却是近半个世纪的事情。随着四五十年代探空资料等高空气象要素的取得,以及六十年代卫星等覆盖全球的观测资料的加入,大气环流的空间结构和时间变化开始被系统、全面地揭示。与此同时,大气环流的数值模拟,也开始成为研究大气环流的一个主要方法,并发展至今成为了解和预估未来气候变化的主要手段。随着观测和模拟手段的进步,大气环流的理论研究也在近三十年开始快速地发展,人们对各种环流系统的维持和变化有了更全面、更深刻、也更为现代的理解。
<center>
 
{| class="wikitable"
现代的大气环流是大气动力学、天气学和气候学相结合的产物。大气环流,既是各种天气现象产生的背景流场,又是各种气候状态形成的动力机制。大气环流在低频、季节、年际、年代际等时间尺度的变化,不但会引起天气现象的变化,也影响着气候状态的形成。而在大气科学领域面临着诸如全球暖化、气候变化、环流异常等重大科学问题的今天,大气环流研究的重要性被推到了前所未有的高度,大气环流也成为活跃发展又充满挑战的学科方向。
| start with an arbitrary bipartition of <math>V</math> into disjoint <math>S_0,S_1</math>;
 
while (true) do
本课程将讲述在过去几十年里大气环流在观测、理论和模拟上取得的进展。希望学生借此课程能熟悉大气环流的基本分布和形态,掌握各主要环流系统的维持和变化机制,建立各环流系统形成的物理模型,了解现阶段的大气环流模式,知道大气环流方向有待解决的科学问题。
:if <math>\exists i\in\{0,1\}</math> and <math>v\in S_i</math> such that <font color=red>(______________)</font>
 
::then <math>v</math> leaves <math>S_i</math> and joins <math>S_{1-i}</math>;
作为一门课程,大气环流内容的讲述常可以有两条线索。一条是全球尺度上大气热量、动量、能量和水汽的分布与输送,Lorenz(1967)和 Peixoto and Oort(1992)是按此线索介绍大气环流的优秀教材;另一条线索,是各纬度、各区域内大气环流系统的形成、维持和变化机制,James(1995)和 Vallis(2006)是按此线索介绍大气环流的经典讲义。根据现阶段大气环流方向的研究特点,本课程的讲述将主要按照后一种方式来展开,并辅以介绍各环流系统对大气各要素场的输送。在介绍各环流系统时,本课程将以观测、理论和模拟为顺序,从各大气环流系统的观测事实入手,介绍大气环流系统的分布特征和时空变化特征;着重介绍关于环流系统的各种动力学模型和现阶段对环流系统的理解;辅以对环流系统模拟研究的介绍;最后通过三者的对比,讨论各环流系统有待研究的问题。
::continue;
 
:end if
= Syllabus =
:break;
本课程具体的内容安排如下:第一章为大气环流的概述,介绍大气环流发展的历史、包含的内容以及大气环流研究的常用观测资料和分析方法。第二章介绍大气环流产生的外部强迫:辐射强迫和下界面过程。第三至六章介绍大气环流中的各个环流系统及它们的动力机制。第七章详细介绍各复杂度的大气环流模式。第八章介绍大气环流领域现阶段最大的一个开放课题:全球暖化背景下的大气环流。这一章既是对前几章所介绍的大气环流理论的应用与检验,又是对未来大气环流研究方向的探讨。借此让学生熟悉并理解大气环流领域亟需解决的课题。具体课程安排和参考书目如下。
end
== Course schedule ==
|}
*大气环流概述 (Introduction) (4课时)
</center>
*大气环流的外部强迫(3课时)
**辐射强迫 (Radiative forcing)
**下界面过程 (Surface boundaries)
*经向环流系统 (Zonally-averaged circulations)
**Hadley 环流(4课时)
**Ferrel 环流,急流,中纬度的波流相互作用(8课时)
*纬向环流系统(Non-zonal circulations)(6课时)
**Storm tracks
**Monsoon
**ENSO and Walker circulation
*能量和水汽循环 (Angular momentum, energy and water vapor)(2课时)
*不同复杂度的大气环流模式 (General circulation in a hierarchy of models)(2课时)
*全球暖化背景下的大气环流 (General circulation in the global warming scenario)(2课时)
 
[[点击此处看详细课程安排 (click for more)]]
 
== References ==
*观测部分:Peixoto, J. P. and A. H. Oort, 1992: Physics of Climate. Springer-Verlag New York, Inc., 520 pp. 中文译本:气候物理学,1995,吴国雄、刘辉等译校,气象出版社。
*综合介绍:James, I., 1995: Introduction to circulating atmospheres. Cambridge University Press, 448 pp.
*理论部分:Vallis, G. K., 2006: Atmospheric and Oceanic Fluid Dynamics: Fundamentals and Large-scale Circulation. Cambridge University Press. 745 pp.
* [[其它参考书目(点击看详情)]]

Revision as of 14:27, 3 November 2020

  • 每道题目的解答都要有完整的解题过程。中英文不限。

Problem 1

Suppose we want to estimate the value of [math]\displaystyle{ Z }[/math]. Let [math]\displaystyle{ \mathcal{A} }[/math] be an algorithm that outputs [math]\displaystyle{ \widehat{Z} }[/math] satisfying [math]\displaystyle{ \Pr[ (1-\epsilon)Z \leq \widehat{Z} \leq (1+\epsilon )Z] \geq \frac{3}{4} . }[/math]

We run [math]\displaystyle{ \mathcal{A} }[/math] independently for [math]\displaystyle{ s }[/math] times, and obtain the outputs [math]\displaystyle{ \widehat{Z}_1,\widehat{Z}_2,\ldots,\widehat{Z}_s }[/math].

Let [math]\displaystyle{ X }[/math] be the median (中位数) of [math]\displaystyle{ \widehat{Z}_1,\widehat{Z}_2,\ldots,\widehat{Z}_s }[/math]. Find the number [math]\displaystyle{ s }[/math] such that [math]\displaystyle{ \Pr[ (1-\epsilon)Z \leq X \leq (1+\epsilon )Z] \geq 1 - \delta . }[/math]

Express [math]\displaystyle{ s }[/math] as a function of [math]\displaystyle{ \delta }[/math]. Make [math]\displaystyle{ s }[/math] as small as possible.

Remark: in this problem, we boost the probability of success from [math]\displaystyle{ \frac{3}{4} }[/math] to [math]\displaystyle{ 1-\delta }[/math]. This method is called the median trick.

Hint: Chernoff bound.

Problem 2

In class, we saw how to estimate the number of distinct elements in a data stream using the Flajolet-Martin algorithm. Consider the following alternative formulation of the distinct elements problem: given an [math]\displaystyle{ N }[/math] dimensional vector [math]\displaystyle{ x }[/math], we want to process a stream of arbitrary increments to entries in [math]\displaystyle{ x }[/math]. In other words, if we see a number [math]\displaystyle{ i\in 1,\dots,N }[/math] in the stream, we update entry [math]\displaystyle{ x_i\gets x_i + 1 }[/math]. Our goal is to estimate [math]\displaystyle{ \left \|x\right \|_0 }[/math], which measures the number of non-zero entries in [math]\displaystyle{ x }[/math]. With [math]\displaystyle{ x }[/math] viewed as a histogram that maintains counts for [math]\displaystyle{ N }[/math] potential elements, [math]\displaystyle{ \left \|x\right \|_0 }[/math] is exactly the number of distinct elements processed. In this problem we will develop an alternative algorithm for estimating [math]\displaystyle{ \left \|x\right \|_0 }[/math] that can also handle decrements to entries in x. Specifically, instead of the stream containing just indices [math]\displaystyle{ i }[/math], it contains pairs [math]\displaystyle{ (i, +) }[/math] and [math]\displaystyle{ (i, −) }[/math]. On receiving [math]\displaystyle{ (i, +) }[/math], [math]\displaystyle{ x }[/math] should update so that [math]\displaystyle{ x_i\gets x_i + 1 }[/math] and on receiving [math]\displaystyle{ (i, −) }[/math], [math]\displaystyle{ x }[/math] should update so that [math]\displaystyle{ x_i\gets x_i - 1 }[/math]. For this problem we will assume that, at the end of our stream, each [math]\displaystyle{ x_i \ge 0 }[/math] (i.e. for a specific index we can’t receive more decrements than increments).

  • Consider a simpler problem. For a given value [math]\displaystyle{ T }[/math], let’s design an algorithm that succeeds with probability [math]\displaystyle{ (1 − \delta) }[/math], outputing LOW if [math]\displaystyle{ T \lt \frac{1}{2}\left \|x\right \|_0 }[/math] and HIGH if [math]\displaystyle{ T \gt 2\left \|x\right \|_0 }[/math]:
    • Assume we have access to a completely random hash function [math]\displaystyle{ h(\cdot) }[/math] that maps each [math]\displaystyle{ i }[/math] to a random point in [math]\displaystyle{ [0, 1] }[/math]. We maintain the estimator [math]\displaystyle{ s=\sum_{i:h(i)\lt \frac{1}{2T}}x_i }[/math] as we receive increment and decrement updates. Show that, at the end of our stream,
      (i) If [math]\displaystyle{ T \lt \frac{1}{2}\left \|x\right \|_0 }[/math], [math]\displaystyle{ Pr_h[s=0]\lt 1/e\approx 0.37 }[/math];
      (ii) If [math]\displaystyle{ T \gt 2\left \|x\right \|_0 }[/math], [math]\displaystyle{ Pr_h[s=0]\gt 0.5 }[/math].
    • Using this fact, show how to use [math]\displaystyle{ k=O(\log 1/\delta) }[/math] independent random hash functions, and corresponding individual estimators [math]\displaystyle{ s_1, s_2, . . . , s_k }[/math], to output LOW if [math]\displaystyle{ T \lt \frac{1}{2}\left \|x\right \|_0 }[/math] and HIGH if [math]\displaystyle{ T \gt 2\left \|x\right \|_0 }[/math]. If neither event occurs you can output either LOW or HIGH. Your algorithm should succeed with probability [math]\displaystyle{ (1 − \delta) }[/math].
  • Using [math]\displaystyle{ O(\log N) }[/math] repetitions of your algorithm for the above decision problem (with [math]\displaystyle{ \delta }[/math] set appropriately), show how to obtain an estimate [math]\displaystyle{ F }[/math] for [math]\displaystyle{ \left \|x\right \|_0 }[/math] such that [math]\displaystyle{ \frac{1}{4}\left \|x\right \|_0\le F\le 4\left \|x\right \|_0 }[/math] w.h.p.(with probability [math]\displaystyle{ 1-O(1/N) }[/math]).

Problem 3

Let [math]\displaystyle{ U }[/math] be a universal set. We use [math]\displaystyle{ 2^U }[/math] to denote the collection of all subsets of [math]\displaystyle{ U }[/math]. Let [math]\displaystyle{ \mathcal{F} }[/math] be a family of hash functions, in which each hash function [math]\displaystyle{ h:2^U \rightarrow \{0,1\}^m }[/math] maps subsets of [math]\displaystyle{ U }[/math] to 0-1 vectors of length [math]\displaystyle{ m }[/math]. A locality sensitive hashing scheme is a distribution on a family [math]\displaystyle{ \mathcal{F} }[/math] of hash functions operating on [math]\displaystyle{ 2^U }[/math], such that for two subsets [math]\displaystyle{ A,B \in 2^U }[/math],

[math]\displaystyle{ (1) \qquad \Pr_{h\in\mathcal{F}}[h(A)=h(B)]=sim(A,B). }[/math]

Here [math]\displaystyle{ sim:2^U \times 2^U \rightarrow [0,1] }[/math] is called the similarity function. Given a hash function family [math]\displaystyle{ \mathcal{F} }[/math] that satisfies Equation (1), we will say that [math]\displaystyle{ \mathcal{F} }[/math] is a locality sensitive hash function family corresponding to similarity function [math]\displaystyle{ sim(\cdot,\cdot) }[/math].

  • For any similarity function [math]\displaystyle{ sim(A,B) }[/math] that admits a locality sensitive hash function family as defined in Equation (1), prove that the distance function [math]\displaystyle{ d(A,B) \triangleq 1-sim(A,B) }[/math] satisfies triangle inequality, formally,

[math]\displaystyle{ \forall A,B,C \in 2^U :\quad d(A,B) + d(B,C) \geq d(A,C). }[/math]

  • Show that there is no locality sensitive hash function family corresponding to Dice's and the Overlap coefficient. Dice's coefficient is defined as:

[math]\displaystyle{ sim_{Dice}(A,B)=\frac{2|A\cap B|}{|A| + |B|}. }[/math]

Overlap coefficient is defined as:

[math]\displaystyle{ sim_{Ovl}(A,B) = \frac{|A\cap B|}{\min(|A|, |B|)}. }[/math]

Hint: use the triangle inequality result.

  • Construct a collection of hash function [math]\displaystyle{ \mathcal{B} }[/math] where [math]\displaystyle{ f : \{0,1\}^m \rightarrow \{0,1\} }[/math] for each [math]\displaystyle{ f \in \mathcal{B} }[/math], together with a probability distribution on [math]\displaystyle{ \mathcal{B} }[/math] such that

[math]\displaystyle{ \forall x, y \in \{0,1\}^m:\quad \Pr_{f \in \mathcal{B}}[f(x) = f(y)] = \begin{cases} 1 &\text{ if } x= y;\\ \frac{1}{2} &\text{ if } x \neq y. \end{cases} }[/math]

Then use the hash function family [math]\displaystyle{ \mathcal{B} }[/math] to prove the following result. Given a locality sensitive hash function family [math]\displaystyle{ \mathcal{F} }[/math] ([math]\displaystyle{ h:2^U \rightarrow \{0,1\}^m }[/math] for each [math]\displaystyle{ h \in \mathcal{F} }[/math]) corresponding to a similarity function [math]\displaystyle{ sim(A,B) }[/math], we can obtain a locality sensitive hash function [math]\displaystyle{ \mathcal{F}' }[/math] ([math]\displaystyle{ h':2^U \rightarrow \{0,1\} }[/math] for each [math]\displaystyle{ h' \in \mathcal{F}' }[/math]) corresponding to the similarity function [math]\displaystyle{ \frac{1+sim(A,B)}{2} }[/math].

Problem 4

Let [math]\displaystyle{ G(V,E) }[/math] be an undirected graph with positive edge weights [math]\displaystyle{ w:E\to\mathbb{Z}^+ }[/math]. Given a partition of [math]\displaystyle{ V }[/math] into [math]\displaystyle{ k }[/math] disjoint subsets [math]\displaystyle{ S_1,S_2,\ldots,S_k }[/math], we define

[math]\displaystyle{ w(S_1,S_2,\ldots,S_k)=\sum_{uv\in E\atop \exists i\neq j: u\in S_i,v\in S_j}w(uv) }[/math]

as the cost of the [math]\displaystyle{ k }[/math]-cut [math]\displaystyle{ \{S_1,S_2,\ldots,S_k\} }[/math]. Our goal is to find a [math]\displaystyle{ k }[/math]-cut with maximum cost.

  1. Give a poly-time greedy algorithm for finding the weighted max [math]\displaystyle{ k }[/math]-cut. Prove that the approximation ratio is [math]\displaystyle{ (1-1/k) }[/math].
  2. Consider the following local search algorithm for the weighted max cut (max 2-cut).
Fill in the blank parenthesis. Give an analysis of the running time of the algorithm. And prove that the approximation ratio is 0.5.
start with an arbitrary bipartition of [math]\displaystyle{ V }[/math] into disjoint [math]\displaystyle{ S_0,S_1 }[/math];

while (true) do

if [math]\displaystyle{ \exists i\in\{0,1\} }[/math] and [math]\displaystyle{ v\in S_i }[/math] such that (______________)
then [math]\displaystyle{ v }[/math] leaves [math]\displaystyle{ S_i }[/math] and joins [math]\displaystyle{ S_{1-i} }[/math];
continue;
end if
break;

end