OMB Lab's research interests include most aspects of computer security, with an emphasis on malware analysis, intrusion detection, and vulnerability analysis.



Our Technology

a) Static Analysis

        In static analysis, we focus on the file format and PE information of malware, such as PE head, Import table, packed information and so on. We also use a decompiler Radux which was made by our lab to find some useful instructions.

b) Dynamic Analysis

        In dynamic analysis, we run malware in an automatic sandbox. We analyze the behavior of every malware through the system call which were captured in sandbox.



Our Algorithm

We make our malware benchmark through a new-designed hierarchical clustering algorithm. Our algorithm can be separated into five steps:

        (1) Construct a graph which contains all nodes in the set. Every node in this graph implies one malware. There is one edge between every two nodes and the value of the edge implies the distance of this two malware. In the initial stages, every node is an independent cluster.

        (2) Input a value K as an expectation and M as a maximum threshold value. K implies the number of clusters we expect to make, and M implies the maximum distance between every two nodes in one cluster. It’s not allowed if a node want to join one cluster but there exists a distance, which is larger than M, of this point from another point in that cluster. It’s not allowed if there exists a distance, larger than M, of one point in cluster A from another point in cluster B, when merging two clusters.

        (3) Ordering all edges by values. It’s similar to Kruskal algorithm but we don’t aim at generate a tree but K Unicom trees.

        (4) Connect the edges ordered by their values. Make sure that every node has a distance which is shorter than M, to any other node in the same cluster. Repeated this step until there does not exist any edge can be allowed to connected.

        (5) After the loop of Step 4, there may be other two situations: the first one is that the number of the clusters generated is K when the algorithm is not completely accomplished. The second situation is that the number of the clusters is more than K. In Situation Two, we can set a maximum threshold value M larger in an acceptable range and continue the loop of Step 4.



EXAMPLE:

Given a set with 6 nodes A,B,C,D,E and F, all the values of edges are integer ranged from 0 to 10. The distance matrix is followed:


Figure 1 the graph of the nodes

Define the threshold value M as 4.

        First, order the edges which are shorter than 4. As the distance matrix shows, the order of the edges is EF(2), AD(3), BC(4), AB(4), BD(4). And in the initial stages, every node is one independent cluster.

        Second, connect the cluster with the edges by the order. When one connection is successful, the different clusters merge into one. As we can see in the graph(1), the edge EF fits to the requirement of the maximum threshold value, so the node E and F is merged into a new cluster. And now the diameter of the cluster {E,F} is 2. Then the next connection is edge AD. So as in graph(2), there are four clusters as {A,D}, {E,F}, B and C.

        In the similar way, we can finally get three clusters as {A,D}, {E,F} and {B,C} as in graph(3) and the forth edge AB is failed to connected because the diameter of {A,B,C,D} is 5, which is larger than 4 as in graph(4).

        If we set the K as 3 in the initial stages, the algorithm will shutdown right now. But if setting the K as 2, we can change M, the maximum threshold value , to 5 if it is acceptable.