Short bio

Robson L. F. Cordeiro received the BSc degree in Computer Science (CS) from the University of Oeste Paulista, Brazil, in 2002, the MSc degree in CS from the Federal University of Rio Grande do Sul, Brazil, in 2005, and the PhD degree in CS from the University of São Paulo, Brazil, in 2011. His PhD program included a visiting period of one year at the Carnegie Mellon University, USA, from 2009 to 2010. He was also a Postdoctoral Researcher at the University of São Paulo, Brazil, from 2011 to 2013. His PhD Dissertation won the ’best CS Dissertation Award’ in 2012 from the Brazilian Computer Society - SBC, and generated one book published by Springer that was chosen as one of the 'Computing Reviews' Notable Computing Books and Articles of 2013' by ACM. Robson is currently an Assistant Professor at the University of São Paulo, Brazil. His research interests include mining and managing Big Data of moderate-to-high dimensionality, complex data and large graphs. He is a member of the IEEE, ACM, and SBC.

The Method Halite for Correlation Clustering

The algorithm Halite is a fast and scalable density-based clustering algorithm for moderate-to-high-dimensionality data able to analyze large collections of complex data elements. It creates a multi-dimensional grid all over the data space and counts the number of points lying at each hyper-cubic cell provided by the grid. A hyper-quad-tree-like structure, called the Counting-tree, is used to store the counts. The tree is thereafter submitted to a filtering process able to identify regions that are, in a statistical sense, denser than its neighboring regions regarding at least one dimension, which leads to the final clustering result. The algorithm is fast and it has linear or quasi-linear time and space complexity regarding both the data size and the dimensionality.

Remark: A first implementation of Halite was initially named as the method MrCC (after Multi-resolution Correlation Clustering) in an earlier Conference Publication of this work. Latter, it was renamed to Halite for clarity, since several improvements on the initial implementation were included into a Journal Publication.

Download
Conference Publication
Journal Publication

The Method BoW for Clustering Terabyte-scale Datasets

The method BoW focuses on the problem of finding clusters in Terabytes of moderate-to-high dimensionality data, such as features extracted from billions of complex data elements. In these cases, a serial processing strategy is usually impractical. Just to read a single Terabyte of data (at 5GB/min on a single modern eSATA disk) one takes more than 3 hours. BoW explores parallelism and can treat as plug-in almost any of the serial clustering methods. The major research challenges addressed are (a) how to minimize the I/O cost, taking care of the already existing data partition (e.g., on disks), and (b) how to minimize the network cost among processing nodes. Either of them may become the bottleneck. Our method automatically spots the bottleneck and chooses a good strategy, one of them uses a novel sampling-and-ignore idea to reduce the network traffic. Specifically, BoW combines (a) potentially any serial algorithm used as a plug-in and (b) makes the plug-in run efficiently in parallel, by adaptively balancing the cost for disk accesses and network accesses, which allows BoW to achieve a very good tradeoff between these two possible bottlenecks.

Download
Conference Publication

The Method QuMinS for Labeling and Summarization

The algorithm QuMinS focuses on two distinct data mining tasks – the tasks of labeling and summarizing large sets of moderate-to-high dimensionality data, such as features extracted from Gigabytes of complex data elements. Specifically, QuMinS is a fast and scalable solution to two problems (a) low-labor labeling – given a large collection of data objects, very few of which are labeled with keywords, find the most suitable labels for the remaining ones, and (b) mining and attention routing – in the same setting, find clusters, the top-NO outlier objects, and the top-NR representative objects. The algorithm is fast and it scales linearly with the data size, besides working even with tiny initial label sets.

Remark: A first implementation of QuMinS was initially named as the method QMAS (after Querying, Mining And Summarizing Multi-modal Databases) in an earlier Conference Publication of this work. Latter, it was renamed to QuMinS for clarity, since several improvements on the initial implementation were included into a Journal Publication.

Download
Conference Publication
Journal Publication