The computing scientists main challenge is not to get confused by the complexities of (their) own making.
-- E.W. Dijkstra
Much data is redundant, noisy, or irrelevant. For example:
So one way to use less data is to only share a small number of prototypes; i.e. use fewer rows. Not only that, but we can use fewer columns:
That is, if we were so foolish as to try to build high-dimensional models, we would fail as the region where we can find related examples would become vanishingly small. Note that this is often called the curse of dimensionality.
Note this curse can also be a blessing:
There is much empirical evidence that just because a data set has n columns, we need not use them all. Numerous researchers have examined what happens when a data miner deliberately ignores some of the columns in the training data. For example, the experiments of Ron Kohavi and George John show that, on numerous real-world datasets, over 80% of columns can be ignored. Further, ignoring those columns doesn't degrade the learner's classification accuracy (in fact, it sometimes even results in small improvements).
Further, if we combine both prototype and column selection, the net result can be a dramatic reduction in the complexity of the data:
The only way larger data sets can be summarized to smaller ones is if there is some superfluous details in the larger set. Hence, before we can advocate such summarizations we must first offer a measure of data set simplicity and only summarize the simpler data. The next figure offers intrinsic dimensionality as such a measure and applies it to 10 data sets with 21 columns of data. As shown in that figure, the intrinsic dimensionality of our data sets can be very small indeed. It is hardly surprising that such an intrinsically low-dimensional data set can be summaried in half a dozen columns and a few dozen rows.
Mail: Com.Sci., 890 Oval Dr, Raleigh, NC, USA, 27695-8206.
Apr2: New paper, IEEE TSE: Optimizing worker recommendations in crowdsourced testing More...
Feb2: Two journal-first papers accepted to ICSE'20.
Jan10: On walkabout in Western Australia for a month.
Jan 5: New IEEE Software Column '5 Laws of SE for AI' More...
Jan 3: New IEEE TSE paper 'Better data labelling' More...
Dec 20: New EMSE journal paper: 'Analytics from Multiple Projects'
Dec 19: Three of my Ph.D. students graduate: Jianfeng Chen, Amritanshu Agrawali, Rahul Krishna
Dec 15: Accepted to ICSE SEIP'20 'Assessing Practitioner Belief's' More...
Dec 10: New EMSE paper 'DUO=Data mining+optimizing More...
Nov 20: New DoD SBIR funding: Sail-on AI. $70K
Oct 10: New TSE paper 'Improving Vulnerability Inspections' More...
Oct 10: New LAS funding 'Fairness is a choice': $164K
Sept 30: New award: most influential paper award (from ICSM'09): 'On the use of Relevance Feedback in IR-based Concept Location' More...
Sept 30: New NSF grant: AI Workforce Empowerment, $950K
Sept 20: New NSF grant: 'Empirical SE for Computationally Science': $590K
Sept 10: New ESA journal paper 'An Intelligent assistant for finding relevant papers'
Sept 1: New funding from Facebook: Ethical by Design, $50K
Aug 1: New gift from Lexis Nexis: $120K
July 12: New NSF grant $499K, Mega-Transfer
July 12: Invited to editorial board, IST journal