imported>Etone |
imported>Etone |
Line 1: |
Line 1: |
| = Parameter Estimation = | | {{Infobox |
| Consider the following abstract problem of parameter estimation.
| | |name = Infobox |
| | |bodystyle = |
| | |title = <font size=3>高级算法 |
| | <br>Advanced Algorithms</font> |
| | |titlestyle = |
|
| |
|
| Let <math>U</math> be a finite set of known size, and let <math>G\subseteq U</math>. We want to estimate the ''parameter'' <math>|G|</math>, i.e. the size of <math>G</math>, or equivalently, the density (or frequency) <math>\frac{|G|}{|U|}</math>.
| | |image = |
| | |imagestyle = |
| | |caption = |
| | |captionstyle = |
| | |headerstyle = background:#ccf; |
| | |labelstyle = background:#ddf; |
| | |datastyle = |
|
| |
|
| We assume two devices:
| | |header1 =Instructor |
| * A '''uniform sampler''' <math>\mathcal{U}</math>, which uniformly and independently samples a member of <math>U</math> upon each calling.
| | |label1 = |
| * A '''membership oracle''' of <math>G</math>, denoted <math>\mathcal{O}</math>. Given as the input an <math>x\in U</math>, <math>\mathcal{O}(x)</math> indicates whether or not <math>x</math> is a member of <math>G</math>.
| | |data1 = |
| | | |header2 = |
| Equipped by <math>\mathcal{U}</math> and <math>\mathcal{O}</math>, we can have the following '''Monte Carlo method''':
| | |label2 = |
| *Choose <math>N</math> independent samples from <math>U</math> by the uniform sampler <math>\mathcal{U}</math>, represented by the random variables <math>X_1,X_2,\ldots, X_N</math>.
| | |data2 = 尹一通<br>郑朝栋 |
| * Let <math>Y_i</math> be the indicator random variable defined as <math>Y_i=\mathcal{O}(X_i)</math>, namely, <math>Y_i</math> indicates whether <math>X_i\in G</math>.
| | |header3 = |
| * Define the estimator random variable
| | |label3 = Email |
| ::<math>Z=\frac{|U|}{N}\sum_{i=1}^N Y_i.</math>
| | |data3 = yinyt@nju.edu.cn chaodong@nju.edu.cn |
| | | |header4 = |
| It is easy to see that <math>\mathbf{E}[Z]=|G|</math> and we might hope that with high probability the value of <math>Z</math> is close to <math>|G|</math>. Formally, <math>Z</math> is called an '''<math>\epsilon</math>-approximation''' of <math>|G|</math> if
| | |label4= office |
| :<math>
| | |data4= 计算机系 804 |
| (1-\epsilon)|G|\le Z\le (1+\epsilon)|G|.
| | |header5 = Class |
| </math>
| | |label5 = |
| | | |data5 = |
| The following theorem states that the probabilistic accuracy of the estimation depends on the number of samples and the ratio between <math>|G|</math> and <math>|U|</math>
| | |header6 = |
| | | |label6 = Class meetings |
| {{Theorem
| | |data6 = Wednesday, 8am-10am <br> 仙I-319 |
| |Theorem (estimator theorem)| | | |header7 = |
| :Let <math>\alpha=\frac{|G|}{|U|}</math>. Then the Monte Carlo method yields an <math>\epsilon</math>-approximation to <math>|G|</math> with probability at least <math>1-\delta</math> provided
| | |label7 = Place |
| ::<math>N\ge\frac{4}{\epsilon^2 \alpha}\ln\frac{2}{\delta}</math>.
| | |data7 = |
| | |header8 = |
| | |label8 = Office hours |
| | |data8 = Wednesday, 10am-12pm <br>计算机系 804(尹一通)、302(郑朝栋) |
| | |header9 = Textbooks |
| | |label9 = |
| | |data9 = |
| | |header10 = |
| | |label10 = |
| | |data10 = [[File:MR-randomized-algorithms.png|border|100px]] |
| | |header11 = |
| | |label11 = |
| | |data11 = Motwani and Raghavan. <br>''Randomized Algorithms''.<br> Cambridge Univ Press, 1995. |
| | |header12 = |
| | |label12 = |
| | |data12 = [[File:Approximation_Algorithms.jpg|border|100px]] |
| | |header13 = |
| | |label13 = |
| | |data13 = Vazirani. <br>''Approximation Algorithms''. <br> Springer-Verlag, 2001. |
| | |belowstyle = background:#ddf; |
| | |below = |
| }} | | }} |
| {{Proof|
| |
| Recall that <math>Y_i</math> indicates whether the <math>i</math>-th sample is in <math>G</math>. Let <math>Y=\sum_{i=1}^NY_i</math>. Then we have <math>Z=\frac{|U|}{N}Y</math>, and hence the event <math>(1-\epsilon)|G|\le Z\le (1+\epsilon)|G|</math> is equivalent to that <math>(1-\epsilon)\frac{|G|}{|U|}N\le Y\le (1+\epsilon)\frac{|G|}{|U|}N</math>. Note that each <math>Y_i</math> is a Bernoulli trial that succeeds with probability <math>\frac{|G|}{|U|}</math>, thus <math>\mathbb{E}[Y]=\frac{|G|}{|U|}N</math>. Then the rest is due to Chernoff bound.}}
| |
|
| |
| A counting algorithm for the set <math>G</math> has to deal with the following three issues:
| |
| # Implement the membership oracle <math>\mathcal{O}</math>. This is usually straightforward, or assumed by the model.
| |
| # Implement the uniform sampler <math>\mathcal{U}</math>. This can be straightforward or highly nontrivial, depending on the problem.
| |
| # Deal with exponentially small <math>\alpha=\frac{|G|}{|U|}</math>. This requires us to cleverly choose the universe <math>U</math>. Sometimes this needs some nontrivial ideas.
| |
|
| |
| = Counting DNFs =
| |
| A disjunctive normal form (DNF) formular is a disjunction (OR) of clauses, where each clause is a conjunction (AND) of literals. For example:
| |
| :<math>(x_1\wedge \overline{x_2}\wedge x_3)\vee(x_2\wedge x_4)\vee(\overline{x_1}\wedge x_3\wedge x_4)</math>.
| |
| Note the difference from the conjunctive normal forms (CNF).
| |
|
| |
| Given a DNF formular <math>\phi</math> as the input, the problem is to count the number of satisfying assignments of <math>\phi</math>. This problem is [http://en.wikipedia.org/wiki/Sharp-P-complete '''#P-complete'''].
| |
|
| |
| Naively applying the Monte Carlo method will not give a good answer. Suppose that there are <math>n</math> variables. Let <math>U=\{\mathrm{true},\mathrm{false}\}^n</math> be the set of all truth assignments of the <math>n</math> variables. Let <math>G=\{x\in U\mid \phi(x)=\mathrm{true}\}</math> be the set of satisfying assignments for <math>\phi</math>. The straightforward use of Monte Carlo method samples <math>N</math> assignments from <math>U</math> and check how many of them satisfy <math>\phi</math>. This algorithm fails when <math>|G|/|U|</math> is exponentially small, namely, when exponentially small fraction of the assignments satisfy the input DNF formula.
| |
|
| |
|
| |
| ;The union of sets problem
| |
| We reformulate the DNF counting problem in a more abstract framework, called the '''union of sets''' problem.
| |
|
| |
| Let <math>V</math> be a finite universe. We are given <math>m</math> subsets <math>H_1,H_2,\ldots,H_m\subseteq V</math>. The following assumptions hold:
| |
| *For all <math>i</math>, <math>|H_i|</math> is computable in poly-time.
| |
| *It is possible to sample uniformly from each individual <math>H_i</math>.
| |
| *For any <math>x\in V</math>, it can be determined in poly-time whether <math>x\in H_i</math>.
| |
|
| |
|
| The goal is to compute the size of <math>H=\bigcup_{i=1}^m H_i</math>.
| | This is the webpage for the ''Advanced Algorithms'' class of fall 2018. Students who take this class should check this page periodically for content updates and new announcements. |
|
| |
|
| DNF counting can be interpreted in this general framework as follows. Suppose that the DNF formula <math>\phi</math> is defined on <math>n</math> variables, and <math>\phi</math> contains <math>m</math> clauses <math>C_1,C_2,\ldots,C_m</math>, where clause <math>C_i</math> has <math>k_i</math> literals. Without loss of generality, we assume that in each clause, each variable appears at most once.
| | = Announcement = |
| * <math>V</math> is the set of all assignments.
| | * (2018/9/5) 新学期第一次上课。 |
| *Each <math>H_i</math> is the set of satisfying assignments for the <math>i</math>-th clause <math>C_i</math> of the DNF formular <math>\phi</math>. Then the union of sets <math>H=\bigcup_i H_i</math> gives the set of satisfying assignments for <math>\phi</math>.
| |
| * Each clause <math>C_i</math> is a conjunction (AND) of literals. It is not hard to see that <math>|H_i|=2^{n-k_i}</math>, which is efficiently computable. | |
| * Sampling from an <math>H_i</math> is simple: we just fix the assignments of the <math>k_i</math> literals of that clause, and sample uniformly and independently the rest <math>(n-k_i)</math> variable assignments.
| |
| * For each assignment <math>x</math>, it is easy to check whether it satisfies a clause <math>C_i</math>, thus it is easy to determine whether <math>x\in H_i</math>.
| |
|
| |
|
| ==The coverage algorithm== | | = Course info = |
| We now introduce the coverage algorithm for the union of sets problem.
| | * '''Instructor ''': 尹一通、郑朝栋 |
| | :*email: yinyt@nju.edu.cn, chaodong@nju.edu.cn |
| | * '''Class meeting''': Wednesday 8am-10am, 仙I-319. |
| | * '''Office hour''': Wednesday 10am-12pm, 计算机系 804. |
|
| |
|
| Consider the multiset <math>U</math> defined by
| | = Syllabus = |
| :<math>U=H_1\uplus H_2\uplus\cdots \uplus H_m</math>,
| |
| where <math>\uplus</math> denotes the multiset union. It is more convenient to define <math>U</math> as the set
| |
| :<math>U=\{(x,i)\mid x\in H_i\}</math>.
| |
| For each <math>x\in H</math>, there may be more than one instances of <math>(x,i)\in U</math>. We can choose a unique representative among the multiple instances <math>(x,i)\in U</math> for the same <math>x\in H</math>, by choosing the <math>(x,i)</math> with the minimum <math>i</math>, and form a set <math>G</math>.
| |
|
| |
|
| Formally, <math>G=\{(x,i)\in U\mid \forall (x,j)\in U, j\le i\}</math>. Every <math>x\in H</math> corresponds to a unique <math>(x,i)\in G</math> where <math>i</math> is the smallest among <math>x\in H_i</math>.
| | === 先修课程 Prerequisites === |
| | * 必须:离散数学,概率论,线性代数。 |
| | * 推荐:算法设计与分析。 |
|
| |
|
| It is obvious that <math>G\subseteq U</math> and
| | === Course materials === |
| :<math>|G|=|H|</math>.
| | * [[高级算法 (Fall 2018) / Course materials|<font size=3>教材和参考书</font>]] |
|
| |
|
| Therefore, estimation of <math>|H|</math> is reduced to estimation of <math>|G|</math> with <math>G\subseteq U</math>. Then <math>|G|</math> can have an <math>\epsilon</math>-approximation with probability <math>(1-\delta)</math> in poly-time, if we can uniformly sample from <math>U</math> and <math>|G|/|U|</math> is suitably small.
| | === 成绩 Grades === |
| | * 课程成绩:本课程将会有若干次作业和一次期末考试。最终成绩将由平时作业成绩和期末考试成绩综合得出。 |
| | * 迟交:如果有特殊的理由,无法按时完成作业,请提前联系授课老师,给出正当理由。否则迟交的作业将不被接受。 |
|
| |
|
| An uniform sample from <math>U</math> can be implemented as follows:
| | === <font color=red> 学术诚信 Academic Integrity </font>=== |
| * generate an <math>i\in\{1,2,\ldots,m\}</math> with probability <math>\frac{|H_i|}{\sum_{i=1}^m|H_i|}</math>;
| | 学术诚信是所有从事学术活动的学生和学者最基本的职业道德底线,本课程将不遗余力的维护学术诚信规范,违反这一底线的行为将不会被容忍。 |
| * uniformly sample an <math>x\in H_i</math>, and return <math>(x,i)</math>.
| |
|
| |
|
| It is easy to see that this gives a uniform member of <math>U</math>. The above sampling procedure is poly-time because each <math>|H_i|</math> can be computed in poly-time, and sampling uniformly from each <math>H_i</math> is poly-time.
| | 作业完成的原则:署你名字的工作必须由你完成。允许讨论,但作业必须独立完成,并在作业中列出所有参与讨论的人。不允许其他任何形式的合作——尤其是与已经完成作业的同学“讨论”。 |
|
| |
|
| We now only need to lower bound the ratio
| | 本课程将对剽窃行为采取零容忍的态度。在完成作业过程中,对他人工作(出版物、互联网资料、其他人的作业等)直接的文本抄袭和对关键思想、关键元素的抄袭,按照 [http://www.acm.org/publications/policies/plagiarism_policy ACM Policy on Plagiarism]的解释,都将视为剽窃。剽窃者成绩将被取消。如果发现互相抄袭行为,<font color=red> 抄袭和被抄袭双方的成绩都将被取消</font>。因此请主动防止自己的作业被他人抄袭。 |
| :<math>\alpha=\frac{|G|}{|U|}</math>. | |
|
| |
|
| We claim that
| | 学术诚信影响学生个人的品行,也关乎整个教育系统的正常运转。为了一点分数而做出学术不端的行为,不仅使自己沦为一个欺骗者,也使他人的诚实努力失去意义。让我们一起努力维护一个诚信的环境。 |
| :<math>\alpha\ge\frac{1}{m}</math>.
| |
| It is easy to see this, because each <math>x\in H</math> has at most <math>m</math> instances of <math>(x,i)</math> in <math>U</math>, and we already know that <math>|G|=|H|</math>.
| |
|
| |
|
| Due to the estimator theorem, this needs <math>\frac{4m}{\epsilon^2}\ln\frac{2}{\delta}</math> uniform random samples from <math>U</math>.
| | = Assignments = |
| | * TBA |
|
| |
|
| This gives the coverage algorithm for the abstract problem of the union of sets. The DNF counting is a special case of it.
| | = Lecture Notes = |
| | # [[高级算法 (Fall 2018)/Min-Cut and Max-Cut|Min-Cut and Max-Cut]] |
| | #: [[高级算法 (Fall 2018)/Probability Basics|Probability basics]] |