A Local Overlapping Community Detection Method in Complex Networks

Community structure detection has great importance in finding the relationships of elements in complex networks. This paper presents a method of simultaneously taking into account the weak community structure definition and community subgraph density, based on the greedy strategy for community expansion. The results are compared with several previous methods on artificial networks and real world networks. And experimental results verify the feasibility and effectiveness of our approach.


Introduction
In recent years, community structure detection has become one of the most important research fields in the complex network analysis. In real world networks, there will always be some overlapping nodes not only belonging to one community. Therefore, researchers have proposed a lot of approaches for finding overlapping communities. For large-scale complex networks, we usually do not know the ground truth of communities. In addition, if we consider detecting communities in global views, it may lead to high complexity. Thus, local method for community detection has become an important idea. Papers [1][2][3][4][5] present methods for discovering communities from the view of locality. But some of these methods have to specify the parameters unpredictable, and some cannot find the outliers in networks even if they have good results. Our paper presents a new local metric that combines the weak community structure definition with community subgraph density. The method can effectively exclude outliers in networks.

Local Community Detection
The basic idea of local community detection is first to select a community center as the current community, and then to add or remove nodes from the community according to certain strategy. The Selection of Community Center. The main several methods of selecting centers are as follows.
Random Node [1]. Take a random node from the network graph that has not yet been assigned to any community as a center. However, if each time select a different node, the results are probably different, which is unstable.
Rank Removal (RaRe) [2]. Nodes are first ranked by some measure of importance, for example, PapeRank. Highly ranked nodes are then removed in groups until small connected components are formed.
Link Aggregate(LA) [3]. Nodes are first ranked, typically using PageRank. Then a node is added to any cluster if adding it improves the value of its metric. If the node is not added to any cluster, it creates a new cluster.
Clique Percolation [4]. That is, using the results of CPM algorithm as the centers.
Maximal Cliques [5]. First finding all maximal cliques in the network graph. Then use clique coverage heuristic to remove those high coverage cliques. That is, centers are a set of the maximal cliques which are not near-duplicate.
In many complex networks, especially social networks, the elements in the center of a community are easy to form clique structures. Lee et al. [6] have demonstrated the higher efficiency and accuracy of maximal cliques than other methods. So our method also use maximal cliques as community centers.
The Selection of Local Metric. The typical two local metric are as follows.
According to the characteristic of a community in a network, the internal edges are more than the external edges. Paper [2] used the ratio of the internal edges against total edges of a community as the local metric, , in in out (1) where E in and E out respectively means the internal and external edges of a community.
According to the weak community structure definition, the internal degree is greater than the external degree. Paper [1,5] used the ratio of the internal degree against the total degree of a community as the local metric. , in in out (2) where K in and K out respectively means the internal and external degree of a community.
In these metrics, Eq.2. presents better characteristic. However, its results usually include many outliers. Although the author added an index α to adjust the size of communities, it was not intended to exclude outliers. Fig.1 illustrates this problem. We have a local community C, nodes n 1 ,…, n 6 , which may be outliers since they are barely connected to C. Let us assume that all nodes in A, A means the set of adjacent nodes of C, n 1 maximizes the Eq.3, then n 1 to n 4 will be added into C, one by one. The reason is that every addition of that kind of nodes do not affect the external degree number but will increase the internal degree number by two. However, such addition would be inaccurate since nodes on the chain, especially those in the chain back, are distant and can be presumed to be outliers. So it would be better if the local metric can handle sparse chains of nodes for some networks. We propose a new local metric that can solve this problem.

Fig.1. A local community with outlier chains
Our function simultaneously takes into account the weak community structure definition and community subgraph density. In a subgraph, the more the links are, the greater the density is. So we define the community subgraph density as the ratio of the number of internal edges of the subgraph and the number of edges of the corresponding complete graph. The density function is as follows, , λ(0≤λ≤1)is a parameter allowing the results to be fine-tuned. Setting λ=0 will produce the same results as Eq.2, while larger values will make the community density more significant than before. This also can produce smaller groups for larger values of λ which allows communities to be produced across a variety of resolutions. In fact, the value of λ is better in near 0.1, thus, we can exclude those chains of outliers effectively, and can avoid too constricted on forming community.

986
Information Technology for Manufacturing Systems III

Local Community Detection Method based on a Metric Combining Weak Community Structure and Community Density: WDLCD
The specific steps are as follows.
(1) Find all maximal cliques in network graph G with at least k nodes (k=3 or 4). We preprocess these cliques as does in paper [5]. First, we order the maximal cliques, largest first. Then a clique will be discarded if at least Ф percent of its nodes have already been covered twice by other larger, accepted cliques. Ф is called the clique coverage heuristic, where C' means the clique in question and |C'| means the number of nodes in it. C i means certain clique that has been accepted. |C'∩C i | means the number of the same nodes of the two cliques. We also choose a value of 0.75 for Ф.
(2) Choose the largest unexpanded clique as a community center (ComCenter  The overlap heuristic about two communities is ε [5], similar to Ф, where C' may be a clique or a community we just get. If the overlap heuristic that C' with any already accepted community C i is no smaller than ε, then C' and C i are high-overlapped communities, so discard C'. Otherwise, accept C'. We set a value of 0.6 for ε. We realize that before we expand a selected clique, if the overlap heuristic of it with certain C i accord with the situation above, the expansion can be omitted. (4) Repeat from (2), until no cliques remain.

Experiment Results
Since the ground truth of communities in many large networks are hard to define. So we apply our algorithm on artificial networks and real networks. All networks are undirected and unweighted. For artificial networks, their real community structures are known, so we can measure the accuracy of our results by comparing the ground truth of communities. NMI (Normalized Mutual Information) is an evaluation index like this. And Lancichinetti et al. [1] modified it to be able to handle overlapping communities. Therefore, we use NMI as the evaluation criterion on artificial networks. For real world networks, we use function EQ [7] to measure the results, ignoring of its limitations. We use other algorithms as comparing, LFM [1], which also uses local greedy optimization strategy, where we set α to 1.1 in order to make a suitable comparison; CPM [8], which has been described above; COPRA [9], which utilizes a label propagation technique, where we set v to 3; CONGO [10], which uses node split method, can be applied to the situation that the number of communities are known, where we set h to 2. And all implementations of them we used were from the authors. Experiments on Artificial Networks. LFR Overlapping Benchmark. It generates networks have no outliers, however, we still can verify our algorithm's effectiveness than several classical overlapping community detection algorithms. We set the number of nodes to 1000, the average nodes degree of the network to 20, the maximum nodes degree to 50, the minimum and maximum size of a community to 10 and 50, the index of the power-law distribution of nodes degree and community size to 2 and 1, mixing parameter(mp) to 0.4~0.9. The greater value of mp means the community structures of the network are less obvious. The results are shown in Table 2. From the above artificial datasets, we can demonstrate WDLCD performs competitively against several known overlapping community detection algorithms. Experiments on Real World Networks. Yeast [11]. Protein-protein interaction networks in yeast, communities are likely to group proteins having the same specific function within the cell. The dataset after processing has 2284 nodes and 6646 edges. The results are shown in Table 3. Information Technology for Manufacturing Systems III NetScience [12]. Coauthorship network of scientists working on network theory and experiment. The dataset after processing has 1461 nodes and 2742 edges. The results are shown in Table 4. In the two experiments, WDLCD also performs better than other algorithms, ignoring LFM acquire a better EQ in yeast.

Conclusion
We introduced a new local metric for detecting overlapping communities which combines the weak community structure definition with the density of a community, can solve the problem of outlier chains in networks. We demonstrate that our algorithm WDLCD is better than several classical algorithms in artificial networks and real world networks.
Although our algorithm can get better results than several classical algorithms, its effectiveness and efficiency are poorer than the excellent local overlapping community detection algorithm GCE [5]. So further work should try to continue improving the local metric and try to make the algorithm parallel. On the other hand, the modularity function used to evaluate the results bases on the theory that when the number of links within the community are much more than the number of links between the community, the better of the community structure of the network. Owning to the exclusion of outlier chains, we need to propose a more effective function to measure the results. Furthermore, we also need to find better and larger real networks to experiment.