Research Paper

Parsimonious Side Propagation

A fast parsimonious linear-programming-based algorithm for training neural networks is proposed that suppresses redundant features while using a minimal number of hidden units. This is achieved by propagating sideways to newly added hidden units the task of separating successive groups of unclassified points. Computational results show an improvement of 26.53% and 19.76% in tenfold cross-validation […]

Scaling EM (Expectation-Maximization) Clustering to Large Databases

Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table Click here to view full-screen version […]

Constrained K-Means Clustering

We consider practical methods for adding constraints to the K-Means clustering algorithm in order to avoid local solutions with empty clusters or clusters having very few points. Click here to view full-screen version >>

Clustering via Concave Minimization

The problem of assigning m points in the n-dimensional real space Rn to k clusters is formulated as that of determining k centers in Rn such that the sum of distances of each point to the nearest center is minimized. If a polyhedral distance is used, the problem can be formulated as that of minimizing […]

Feature Selection via Mathematical Programming

The problem of discriminating between two fi nite point sets in n-dimensional feature space by a separating plane that utilizes as few of the features as possible, is formulated as a mathematical program with a parametric objective function and linear constraints. The step function that appears in the objective function can be approximated by a sigmoid […]

Parsimonious Least Norm Approximation

A theoretically justifiable fast finite successive linear approximation algorithm is proposed for obtaining a parsimonious solution to a corrupted linear system Ax = b + p, where the corruption p is due to noise or error in measurement. Click here to view full-screen version >>

Feature Selection via Concave Minimization and Support Vector Machines

Computational comparison is made between two feature selection approaches for fi nding a separating plane that discriminates between two point sets in an n-dimensional feature space that utilizes as few of the n features (dimensions) as possible. In the concave minimization approach [19, 5] a separating plane is generated by minimizing a weighted sum of distances […]

Refining Initial Points for K-Means Clustering

Practical approaches to clustering use an iterative procedure (e.g. K-Means, EM) which converges to one of numerous local minima. It is known that these iterative techniques are especially sensitive to initial starting conditions. We present a procedure for computing a refined starting condition from a given initial one that is based on an efficient technique […]

Scaling Clustering Algorithms to Large Databases

Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this work, the framework is instantiated and numerically justified with the popular K-Means […]

Initialization of Iterative Refinement Clustering Algorithms

Iterative refinement clustering algorithms (e.g. K-Means, EM) converge to one of numerous local minima. It is known that they are especially sensitive to initial conditions. We present a procedure for computing a refined starting condition from a given initial one that is based on an efficient technique for estimating the modes of a distribution. The […]

Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions

Efficiently answering decision support queries is an important problem. Most of the work in this direction has been in the context of the data cube. Queries are efficiently answered by pre-computing large parts of the cube. Besides having large space requirements, such pre-computation requires that the hierarchy along each dimension be fixed (hence dimensions are […]

Massive Data Discrimination via Linear Support Vector Machines

A linear support vector machine formulation is used to generate a fast, fi nitely terminating linear-programming algorithm for discriminating between two massive sets in n-dimensional space, where the number of points can be orders of magnitude sufficiently small linear programs that separate chunks of the data at a time. Click here to view full-screen version >>

k-Plane Clustering

A fi nite new algorithm is proposed for clustering m given points in n-dimensional real space into k clusters by generating k planes that constitute a local solution to the nonconvex problem of minimizing the sum of squares of the 2-norm distances between each point and a nearest plane. Click here to view full-screen version >>

Efficient Discovery of Error-Tolerant Frequent Itemsets in High Dimensions

We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies error tolerant frequent clusters of items in transactional data (customer purchase data, web browsing data, text, etc.). The algorithm exploits sparseness of the underlying data to find […]

Model Based Population Tracking and Automatic Detection of Distribution Changes

Probabilistic mixture models are used for a broad range of data analysis tasks such as clustering, classification, predictive modeling, etc. Due to their inherent probabilistic nature, mixture models can easily be combined with other probabilistic or non-probabilistic techniques thus forming more complex data analysis systems. Click here to view full-screen version >>

Optimization Methods In Massive Datasets

We describe the role of generalized support vector machines in separating massive and complex data using arbitrary nonlinear kernels. Click here to view full-screen version >>

Data Mining as an Automated Service

An automated data mining service offers an out- sourced, cost effective analysis option for clients desiring to leverage their data resources for decision support and operational improvement. In the context of the service model, typically the client provides the service with data and other information likely to aid in the analysis process (e.g. domain knowledge, […]

Philosophies and Advances in Scaling Mining Algorithms to Large Databases

Data mining has become increasingly important as a key to analyzing, digesting and understanding the flood of digital data. Achieving this goal requires scaling mining algorithms to large databases. Click here to view full-screen version >>