Lecture 12 : Machine Learning and Data Mining in Healthcare

Learning is a process in which an agent acquires or improves knowledge, skills, or performance. Machine learning is an interdisciplinary field what concerns learning abilities in computers. It is closely related to artificial intelligence as well as other disciplines such as statistics, logic, psychology, and neuroscience. In most common situations computers learn by analyzing data. That data can come from databases, be acquired while learning (i.e., from real-time sensors), or be given by a teacher.

Data mining is a field that is concerns analyzing large amounts of data and discovering new patterns, regularities, and models. It uses methods from many disciplines including databases, machine learning, statistics, and high performance computing.

This lecture only touches on many important topics related to machine learning and data mining. Similarly to many topics in this class, it is intended to let students know what it is, and give some general ideas. It is important when discussing new technologies and potential solutions that can affect quality and cost of care in all types of healthcare organizations. Students interested in exploring more are encouraged to take HAP 780 (Data Mining in Health Care) course at GMU. The course specifically addresses issues related to analyzing and learning from healthcare data.

Vast majority of applications of machine learning in healthcare are to create models that can reason about new cases and provide answers to specific questions. For example, such a model can answer question if a patient will be readmitted to a hospital within 30 days from discharge. Machine learning methods use complex algorithms that analyze past patients and find out regularities that will allow making predictions for the future patients. This type of learning is called supervised learning (a.k.a., learning with teacher or learning from labeled data), because a "teacher" provides examples of past patients and their readmissions and thus supervises the learning process. It is up to the algorithm to crunch numbers and go through thousands of values to learn.

This is in contrast to unsupervised learning (a.k.a., learning without teacher, learning from unlabeled data, or clustering) in which it is up to algorithm to decide how to group together similar objects (i.e., patients with similar clinical characteristics). Finally, reinforcement learning allows an agent (usually computer program or a robot) to explore environment and find out autonomously how to perform a given task better.

These and several other issues are briefly described here in first part of the lecture.

If the above video does not work, download this file: Lecture Part 1.mp4


Weka is an open source machine learning and data mining software from the University of Waikato, New Zealand. It is a popular free tool that allows analyzing data by applying many available methods. It includes data preprocessing, supervised an unsupervised learning, some visualization, experimentation, and application methods.

However, Weka is a general tool that can be applied to many domains and is not dedicated to healthcare. This means that it is up to user to encode data in ways that are both clinically relevant and understandable by algorithms.

If the above video does not work, download this file: Lecture Part 2.mp4