Science Made Simple: What Is Machine Learning?

Machine learning uses computers to detect patterns in large datasets and make predictions based on those patterns.

What Is Machine Learning?

Machine learning is the process of using computers to detect patterns in massive datasets and then make predictions based on what the computer learns from those patterns. This makes machine learning a specific and narrow type of artificial intelligence. Full artificial intelligence involves machines that can perform abilities we associate with the minds of human beings and intelligent animals, such as perceiving, learning, and problem-solving.

All machine learning is based on algorithms. In general, algorithms are sets of specific instructions that a computer uses to solve problems. In machine learning, algorithms are rules for how to analyze data using statistics. Machine learning systems use these rules to identify relationships between data inputs and desired outputs–usually predictions. To get started, scientists give machine learning systems a set of training data. The systems apply their algorithms to this data to train themselves how to analyze similar inputs they receive in the future.

Machine-learning can quickly analyze complex phenomena like this simulation of ice crystals. Machine learning combined shape classification, image processing, and statistical analysis to identify and characterize the ice grains. Credit: Image courtesy of Argonne National Laboratory

One area where machine learning shows huge promise is detecting cancer in computer tomography (CT) imaging. First, researchers assemble as many CT images as possible to use as training data. Some of these images show tissue with cancerous cells, and some show healthy tissues. Researchers also assemble information on what to look for in an image to identify cancer. For example, this might include what the boundaries of cancerous tumors look like. Next, they create rules on the relationship between data in the images and what doctors know about identifying cancer. Then they give these rules and the training data to the machine learning system. The system uses the rules and the training data to teach itself how to recognize cancerous tissue. Finally, the system gets a new patient’s CT images. Using what it has learned, the system decides which images show signs of cancer, faster than any human could. Doctors could use the system’s predictions to aid in the decision about whether a patient has cancer and how to treat it.

The way training data is set up divides machine learning systems into two broad types: supervised and unsupervised. If the training data is labeled, the system is supervised. Labeled data tells the system what the data is. For example, CT images could be labeled to indicate cancerous lesions or tumors next to tissues that are healthy. Basically, this means the machine learning system learns by example. Labeling data can be very time-consuming for the large amounts of data required for training datasets.

If the training data is not labeled, the machine learning system is unsupervised. In the cancer scan example, an unsupervised machine learning system would be given a huge number of CT scans and information on tumor types, then left to teach itself what to look for to recognize cancer. This frees human beings from needing to label the data used in the training process. The disadvantage of unsupervised learning is that the results may not be as accurate because of the lack of explicit labels.

Some machine learning systems can improve their abilities based on feedback received on the predictions. These are called reinforcement machine learning systems. For example, the system could be told the results of doctors’ other tests of whether patients have cancer or not. The system could then tweak its algorithms to produce more accurate predictions in the future.

Fast Facts

The newest of DOE’s supercomputers—Summit at Oak Ridge National Laboratory—has an architecture especially well-suited for artificial intelligence applications.
Machine learning allows scientists to analyze quantities of data that were previously inaccessible.
DOE-funded researchers have used machine learning to develop new cancer screening, better understand the properties of water, and autonomously steer experiments.
Physics-informed machine learning uses deep neural networks that can be trained to incorporate specific laws of physics to solve supervised learning tasks and scientific problems.
Machine learning algorithms are not a silver bullet. The development of machine learning systems is susceptible to human error and biases and requires the same careful design as software engineering.

DOE Office of Science: Contributions to Machine Learning

The Department of Energy Office of Science supports research on machine learning through its Advanced Scientific Computing Research (ASCR) program. ASCR has a portfolio of data management, data analysis, computer technology, and related research that all contribute to machine learning and artificial intelligence. As part of this portfolio, DOE owns some of the world’s most capable supercomputers.

The DOE Office of Science as a whole is committed to the use of machine learning to support scientific research. Science depends on big data, and Office of Science user facilities such as particle accelerators and X-ray light sources generate mountains of it. Using machine learning, researchers are identifying patterns or designs in data from these facilities that are difficult or impossible for humans to detect, at speeds that are hundreds to thousands of times faster than traditional data analysis techniques.