## Research projects during my undergraduate study

An Example of Similar Group

Compute Treewidth

Treewidth is an important notion in many fields nowadays, including Constraint Satisfaction, Probabilistic Inference, etc. Since the complexity of these algorithms is usually exponential in the treewidth of the graph, a smaller treewidth will help improve the efficiency of these algorithms. Moreover, many in general NP-complete problems can be solved in polynomial or even linear time, when the treewidth of the input graph is bounded. So treewidth is important from both practical and theoretical perspectives. I proposed a new algorithm called FPBB for computing treewidth, and has solved 17 benchmark graphs whose exact treewidth were previously unknown in TreewidthLIB. FPBB is a parallel branch and bound algorithm, and utilizes a new heuristic called similar group. FPBB is significantly faster than the state-of-the-art algorithm proposed by Zhou and Hansen. The executable files and source code of FPBB can be found here.

Special thanks:

This work was originated from an undergraduate research program supervised by Prof Tian Liu. Thanks are due to Prof Tian Liu and Prof Ke Xu for introducing me to the notion of treewidth and for suggestions on improving treewidth algorithms. Thanks are also due to my classmate Chaoyi Wang and other students for helpful discussions on related topics.

Treewidth is an important notion in many fields nowadays, including Constraint Satisfaction, Probabilistic Inference, etc. Since the complexity of these algorithms is usually exponential in the treewidth of the graph, a smaller treewidth will help improve the efficiency of these algorithms. Moreover, many in general NP-complete problems can be solved in polynomial or even linear time, when the treewidth of the input graph is bounded. So treewidth is important from both practical and theoretical perspectives. I proposed a new algorithm called FPBB for computing treewidth, and has solved 17 benchmark graphs whose exact treewidth were previously unknown in TreewidthLIB. FPBB is a parallel branch and bound algorithm, and utilizes a new heuristic called similar group. FPBB is significantly faster than the state-of-the-art algorithm proposed by Zhou and Hansen. The executable files and source code of FPBB can be found here.

Special thanks:

This work was originated from an undergraduate research program supervised by Prof Tian Liu. Thanks are due to Prof Tian Liu and Prof Ke Xu for introducing me to the notion of treewidth and for suggestions on improving treewidth algorithms. Thanks are also due to my classmate Chaoyi Wang and other students for helpful discussions on related topics.

## Code Clone Detection

In software development, it is common to reuse some code fragments by copying with or without minor modifications. This kind of code fragments are called code clones. It was shown that code clones are harmful in software maintenance and evolution. There are mainly four types of code clone detection techniques: textual, lexical, syntactic and semantic. I designed a new algorithm based on counting called CMCD, which uses Count Matrix to represent a code block. By comparing each Count Matrix, CMCD can find code clones. CMCD has good performance on the scenario-based benchmarks, and can be used for aspect mining or bug detection. Using CMCD, I found many aspects and a bug in JDK source code. Besides, CMCD can be used for detecting code plagiarism. Two courses in Peking University have adopted CMCD to check students' homework. I'm still working on this project, and currently only JAVA version of CMCD is available.

Based on the idea of Count Matrix, I designed a new token based algorithm called Boreas, which further introduced the notions of Count Vectors of keywords and symbols. Boreas works on the block level granularity, and is much faster than the state-of-the-art algorithm Deckard. It is able to find meaningful large clone clusters. Some results can be found here.

## Compute Hinge Width

Hinge width is a parameter of structural CSP decomposition, and was introduced in Database Theory. It is incomparable with treewidth, and is efficiently computable. I conducted experiments on instances of realistic size for different kinds of random graphs to verify that hinge widths of these random graphs are large when the size of the graphs grows (the Experiments Section of the paper Large Hinge Width on Sparse Random Hypergraphs, IJCAI 2011). For the source code, see here. I also wrote the code for generating G(n,p), G(n,M) and Power Law graphs, which is included in the source code.

## Stereo Matching

Stereo Matching is an important problem in Computer Vision. The goal of this problem is to determine disparities of two or more pictures. Generally, it is an ill-posed problem, because many problems, such as the existence of occluded regions, textureless regions and sensor noise, prevent recovering an accurate disparity map. A popular kind of algorithms for Stereo Matching is using Mean-Shift approach to do color segmentation first. We (with Tianxiang Zhang) designed a new algorithm following this idea, and used special triple iteration method to do the plane assignment step. Our program runs comparatively fast, and ranked #93 in Middlebury Stereo Vision Platform. Moreover, we did little parameter adjustment based on the benchmark, so the results are nature and good. For the source code, see here.

## Data Compression

I'm interested in data compression algorithms. I have implemented the LZW algorithm, see here. I also designed a new algorithm based on DEFLATE & LZW algorithms, and has slightly better performance on the Large Text Compression Benchmark than LZW algorithm (but significantly slower), see here.