“Machine Learning” is a term trending not only within the technology industry but also in industries such as government, healthcare, marketing and education.
Machine Learning, along with other terms such as Predictive Analytics, Data Analytics, Classification, Pattern Recognition and Data Science, are often casually, if not wrongly, used without much interpretation. If that is not confusing enough, consider also traditional terms such as Statistical Analysis, Data Warehousing, Data Mining, Knowledge Discovery, Artificial Intelligence and Business Intelligence. The line between these terms is often so fine even for people working in the field.
At PredictionIO, we believe that Machine Learning is the foundation of a better and smarter future. As the creator of the first open source Machine Learning server, we feel the need to share our understanding of “Machine Learning” and what differentiates it from other terms.
Please note that some related terms such as Data Visualization, Natural Language Processing and Signal Processing in Electrical Engineering are more self-explanatory. Therefore, we are not going to elaborate on them here.
What is Machine Learning?
Machine learning, in short, refers to computers learning to predict from data.
Machine Learning has empowered many smart applications. For example, Apple’s Siri learns from data to predict the meanings of human voice and the desired answers or actions to be performed. Facebook’s photo album learns from data to predict (or recognize) faces to be tagged in photos. LinkedIn learns from data to predict who you want to connect with. Google’s driverless car learns from data to predict the appropriate driving actions.
So the goal of Machine Learning is for a computer to predict something. While predicting a future event is one of the obvious scenarios, it also encompasses the prediction of events or things that are unknown to the computer, i.e. something you have not inputted or programmed into it. Arthur Samuel describes it way back in 1959: “(Machine Learning is a) field of study that gives computers the ability to learn without being explicitly programmed”.
Learning implies improvement through gaining experience or knowledge. Tom Mitchell emphasizes the learning requirement in his book Machine Learning: “A (machine learning) computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”. Because of this, performance measurement is an essential component in Machine Learning. Without a meaningful evaluation metric, whether it is quantitative or qualitative, the goal of Machine Learning can hardly be achieved.
Data is the known properties, experience or knowledge that the computer learns from. Data used for training prediction model in Machine Learning is called training data. Data used for evaluating performance during the training stage is called validation data. Those used for evaluating the final performance is called test data.
In summary, Machine Learning is about attempting to teach computers to predict future, or otherwise unknown events by applying computer science or statistics techniques to analyze existing data. It can be seen as a transformation from existing data to improve insights about the unknown.
Machine Learning is often associated with other frequently used terms such as predictive analytics, big data, data mining, business intelligence and data science. Let’s take a look at them briefly.
An equivalent, but less formal, term for Machine Learning is Predictive Analytics. It is widely used in business. As Eric Siegel states in his new book Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die: “[Predictive Analytics is) technology that learns from experience [i.e. data] to predict the future behavior of individuals in order to drive better decisions”. Basically, that points to what Machine Learning is about.
Big data is a fuel of Machine Learning development.
O’Reilly summarizes the characteristics of Big Data with Volume, Velocity and Variety. Gartner similarly states that “Big data is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” Organizations collect a large variety of data in huge volume with high velocity from different sources such as smartphones and social media. The availability of big data poses new challenges and opportunities at the same time. Data is the essential ingredient to empower Machine Learning, to enable computers to learn and to predict better. On the other hand, it stretches the limits of scalability, computational power, processing speed and flexibility of existing machine learning technology.
Supervised Learning is one of the popular task types in Machine Learning.
The task for the computer is to take an input and then predict an output based on what it has learnt from the training data. Training data consists of pairs of input and correct output, which are called labeled data. When the output is a discrete variable, e.g. predict whether tomorrow will be 1) sunny; 2) cloudy; or 3) raining, it is called a Classification problem. When the output is a continuous variable, e.g. predict the temperature of tomorrow, it is called a Regression problem. Classification is also called Discriminant Analysis in statistics and Pattern Recognition in engineering.
Unsupervised Learning is another popular task type in Machine Learning. The task for the computer is to discover structures or patterns in the training data, and then to predict which one the input belongs to. The training data consists of only input but not any example output, which is called unlabeled data. The searching for structure with Unsupervised Learning in Machine Learning is called Density Estimation in statistics.
Artificial Intelligence is about constructing a computer system, called intelligent agent, to behave and perform tasks like a human. Despite the lack of emphasis on learning from data and prediction, Artificial Intelligence is closely related to Machine Learning, because, as Russell and Norvig point out in Artificial Intelligence: A Modern Approach, an intelligent system is able to adapt to changes in its environment. Their applications are often very similar and they share some techniques such as Neural Networks. As the definition of Artificial Intelligence is broader and more general than that of Machine Learning, one may claim that Machine Learning is a specific class of problems in Artificial Intelligence.
As mentioned, Machine Learning is about teaching computers to predict the unknown by learning from known data, and this differs from the goal of Statistics. Statistics, or Statistical Analysis, is about teaching humans what has happened or what is happening by looking at data, in order to make better decisions.
Machine Learning is a learning process to build up reusable knowledge for a computer to perform tasks in the future. Statistical Analysis is an exploratory process to discover previously unknown trends, patterns or other knowledge about the data provided, and the end result should be interpretable by humans.
Statisticians also have their own set of language. For example, as Ethem Alpaydm mentions in Introduction to Machine Learning, “In statistics, going from particular observations to general descriptions is called inference and learning is called estimation.”
Despite the difference in terms of their goals, Statistical Analysis and Machine Learning overlap a lot. They both analyze data and extract insights from it. Also, statistical techniques are frequently used in Machine Learning.
Roughly speaking, Data Mining, a term used in Computer Science, shares exactly the same goals as Statistical Analytics. The historical difference is described in Jerome Friedman’s “Data Mining and Statistics: What’s the Connection?” (1998), “Despite the obvious connections between data mining and statistical data analysis, most of the methodologies used in Data Mining have so far originated in fields other than Statistics”. However, the two fields have evolved so much since, that they now share many common methodologies. Nowadays, these two terms are basically interchangeable.
When we talk about Data Mining, it is worth mentioning a closely related term: Knowledge Discovery in Databases (KDD). The distinction between Data Mining and Knowledge Discovery in Databases was discussed in “From Data Mining to Knowledge Discovery in Databases”: “KDD refers to the overall process of discovering useful knowledge from data, and data mining refers to a particular step in this process. Data mining is the application of specific algorithms for extracting patterns from data.”
The additional steps of KDD mentioned in this paper are data preparation, data selection, data cleaning, incorporation of appropriate prior knowledge, and proper interpretation of the results of mining. They are not part of Data Mining.
Business Intelligence / Business Analytics
As the name suggests, business usage is the focus here. Business Intelligence, or equivalently Business Analytics, is a very broad term referring to commercial organizations using data to learn about the business, market or customers and to make factually-supported decisions. It may also involve making prediction for the future or the unknown for the benefit of the business. It may be described as the KDD for Business.
Lastly, data science is an umbrella term for everything mentioned above that makes use of data, with an emphasis of the use of sophisticated algorithms or scientific methods.
The term has a strong academic background. Peter Naur, a professor of computer science at University of Copenhagen, first coined it in the 1960s. In the mid-1990s, the International Federation of Classification Societies describes it as “research in problems of classification, data analysis, and systems for ordering knowledge”. Data Science was proposed as a new academic discipline in 2001 by William Cleveland, who states that it involves “advances in computing with data” on statistics.
The term Data Science is widely used in business in recent years. Before its widespread use, the business community tended to use two other umbrella terms with similar meanings: Data Analytics and Data Warehousing. They are about analyzing data to extract useful information or knowledge. Data warehousing is usually about tasks that produce one-off reporting in an offline mode, while data analytics can be in a real-time online environment. Unlike Data Science, however, they do not imply the use of sophisticated algorithms.
The profession of a Data Scientist is also becoming more popular. An article in the Harvard Business Review describes it as “the sexiest job of the 21st century”.
Data science usually involves techniques based on research in the field of computer science, statistics, math, physics, finance, economics and psychology etc.
The Uniqueness of Machine Learning
As you may have noticed already, Machine Learning is uniquely about:
- Helping a computer to learn, instead of helping a human to interpret
- Specifically focus on predicting the future or the unknown
- Improving performance as more data is analyzed
In this piece, we have boldly attempted to define scopes and boundaries for a number of multidisciplinary conceptual terms. The purpose is to explain their relationships with Machine Learning. While the descriptions we provide here may not cover the views of all perspectives completely, we hope they help to clarify some of the confusion in the industry. Your suggestions are very welcome.