15-451 11/11/03 * NP-completeness recap * Approximation Algorithms Reminder: quiz on Thurs. Up to and including NP-completeness ====================================================================== NP-completeness recap: Going back to network flow and linear programming, one thing that makes these problems important is that you can use them to represent a lot of problems (i.e., you can reduce a lot of problems to them). One way to think about NP-completeness is that a problem like 3-SAT is so expressive, it can represent *any* problem in NP. I.e., you can reduce any problem in NP to that. One way to show some other problem X is NP-complete by showing how a magic algorithm to solve X could be used to solve 3-SAT (which then can solve anything in NP). A good way to think of this is to imagine someone has an instance of 3-SAT that they will pay $1,000,000 to solve, and someone else has a magic box for X they're willing to sell for $100,000. So, what you want to do is show how you could convert the given 3-SAT instance to an instance of problem X, run the magic algorithm, and have this give you a solution to the original million-dollar 3-SAT instance. Any more questions about NP-completeness? Today: look at approximation algorihms. General Question: Maybe we can't hope to always get the best solution, but can we at least guarantee to get a "pretty good" solution? E.g., can we guarantee to find a solution that's within 10% of optimal? Or how about within a factor of 2 of optimal? Or, anything non-trivial? Interesting thing: even though all NP-complete problems are equivalent in terms of difficulty of finding optimal solution, the hardness or easyness of getting a good approximation varies all over the map. VERTEX COVER ============ - Given a graph G, looking for smallest set of vertices such that every edge is incident to (touches) at least one vertex in the set. - Example: +----+----+ | | | +----+----+ - Can think of as: what is the fewest # of guards we need to place in museum to cover all the corridors. - This problem is NP-hard for general graphs. (You did this on the quiz for the case G is a *tree*). But it turns out we can get within a factor of 2. Let's start with some strategies that *don't* work. Strawman #1: Pick arbitrary vertex with at least one uncovered edge incident, put into cover, and repeat. What would be a bad example? [A: how about a star graph] Strawman #2: how about picking the vertex that covers the *most* uncovered edges. Turns out this doesn't work either. [make bipartite graph where opt is size t, but this alg finds one of size t + floor(t/2) + floor(t/3) + floor(t/4) + ... + 1. This is Theta(t log t). Best examples are with t=6 or t=12.] How to get factor of 2. Two algorithms: Alg1: Pick arbitrary edge. We know any VC must have at least 1 endpt, so lets take both. Then throw out all edges covered and repeat. Keep going until no uncovered edges left. What we've found in the end is a matching (a set of edges no two of which share an endpoint) that's "maximal" (meaning that you can't add any more edges to it and keep it a matching). This means if we take all endpoints, we have a VC. So, if we picked k edges, our VC has size 2k, but any VC must have size at least k since you need to have at least one endpoint of each of these edges. Here's another 2-approximation algorithm for Vertex Cover: Alg2: Step1: Solve a *fractional* version of the problem. Have a variable x_i for each vertex. Constraint 0<= x_i <= 1. Each edge should be "fractionally covered": for each edge (i,j) we want x_i+x_j >= 1. Then our goal is to minimize sum_i x_i. We can solve this using linear programming. E.g., triangle-graph. E.g., star-graph. Step2: now pick each vertex i such that x_i >= 1/2. Claim 1: this is a VC. Why? Claim 2: The size of this VC is at most twice the size of the optimal VC. Why? Explanation in two parts: First, notice that it's at most twice the value of the *fractional* solution we found (because it's no worse than doubling and then rounding down). Then, notice that the size of the optimal fractional solution is <= the size of the optimal integral solution, i.e., the optimal VC, because the optimal VC *is* a legal fractional solution. Interesting fact: nobody knows any algorithm with approximation ratio 1.9. Best known is 2 - O((loglog n)/(log n)), which is 2 - o(1). Current best hardness result: Hastad shows 7/6 is NP-hard. Recently improved to 1.361 by Dinur and Safra. Beating 2-epsilon has been related to some other open problems, but not known to be NP-hard. SET-COVER --------- On hwk6, problem 1 talks about the set-cover problem: Given a domain X of n points, and m subsets S_1, S_2, ..., S_m of these points. Goal: find the fewest number of these subsets needed to cover all the points. As you will be showing on hwk6, set-cover is NP-hard. However, there's a simple algorithm that gets a ratio of ln(n): Greedy Algorithm: Pick the set that covers the most points. Throw out all the points covered. Repeat. What's an example where this *doesn't* find the best solution? Theorem: If the optimal solution uses k sets, the greedy algorithm finds a solution with at most k*ln(n) sets. Proof: Since the optimal solution uses k sets, there must some set that covers at least a 1/k fraction of the points. Therefore, after the first iteration of the algorithm, there are at most n(1 - 1/k) points left. After the second, there are at most n(1 - 1/k)^2 points left, etc. After k rounds, there are at most n(1 - 1/k)^k < n*(1/e) points left. After k*ln(n) rounds, there are < n*(1/e)^{ln n} = 1 points left, which means we must be done. In fact, it's been proven that unless everything in NP can be solved in time n^{O(loglog n)}, then you can't get better than (1-epsilon)*ln(n) [Feige]. MAX-SAT: Given a CNF formula (like in SAT), try to maximize the number of clauses satisfied. To make things cleaner, let's assume we have reduced each clause [so, (x or x or y) would become just (x or y), and (x or not(x)) would be removed] Let m = # of clauses in the formula, m_1 = # clauses of size 1, m_2 = # clauses of size 2, etc. So, m = m_1 + m_2 + ... Claim: There exists an assignment that satisfies at least sum_k m_k(1 - 1/2^k) clauses. Furthermore, there is an efficient algorithm to find one. Corollary: if every clause has size 3 (this is sometimes called the MAX-exactly-3-SAT problem), then we can satisfy at least m(7/8) clauses. So, this is for sure a 7/8-approximation, since the optimal solution satisfies at most m clauses. Interesting fact: getting a 7/8 + epsilon approximation for any constant epsilon (like .001) is NP-hard. Proof of claim: Consider a random assignment to the variables. For a clause of size k, the chance it is *not* satisfied is 1/2^k. So, the expected # of clauses satisfied is sum_k m_k(1 - 1/2^k). Since a random assignment satisfies this many in expectation, surely an assignment that does this well must exist. Also, if we just keep trying random assignments, and it's not hard to show that whp, pretty soon we'll find one that's as least as good as this expectation. How about a deterministic algorithm? Here's what we can do. Look at x_1: for each of the two possible settings (0 or 1) we can calculate the expected number of clauses satisfied if we were to go with that setting, and then set the rest of the variables randomly. (It is just the number of clauses already satisfied plus sum_k m_k(1-1/2^k), where m_k is the number of clauses of size k in the ``formula to go''.) Fix x_1 to the setting that gives us a larger expectation. Now go on to x_2 and do the same thing, setting it to the value with the highest expectation-to-go, and then x_3 and so on. The point is that since we always pick the setting whose expectation-to-go is larger, this expectation-to-go never decreases (since our current expectation is the average of the ones we get by setting the next variable to 0 or 1). This is called the ``conditional expectation'' method. The algorithm itself is completely deterministic --- in fact we could rewrite it to get rid of any hint of randomization by just viewing sum_k m_k(1-1/2^k) as a way of weighting the clauses to favor the small ones, but our motivation was based on the randomized method. In general, the area of approximation algorithms and approximation hardness is a major area of algorithms research. Occupies a good fraction of major algorithms conferences.