Basic Machine Learning Algorithms - Part One

From Ancient Greeks to Modern Geeks: Basic Machine Learning Algorithms in C#

Part One: Naive Bayes

This article was published as part of the 2022 C# Advent Event. Make sure to check out the other articles that were part of the event too!

Introduction

Anyone familiar with my work so far will know that my usual topic is functional programming - specifically how to accomplish it in C#. There are actually an awful lot of areas of interest in coding that I'm into, and among them is Machine Learning. I've done a few talks touching on it in the past, like the one I've done on randomly generating a new book by Jane Austen using Markhov chains (have a hunt around on YouTube, there are a few recorded versions of it out there). This time though, I wanted to take a much deeper dive into the topic. When it's over, I might have a think about practical applications, and see whether I can put a project together.

I'm also planning to do all of this in C#. I'm aware that C# isn't the go-to language for most data scientists (take a bow, Python) but I wanted to do it this way for a couple of reasons.

The first is that C# is the language I know the best. My day job is developing bespoke business applications in C#, and I've been doing it for coming on 2 decades now, so I'm pretty well versed in .NET.

The other is as a learning tool, not just for myself, but for any other .NET developers that want to get into Machine Learning, but are put off by the unfamiliar syntax of Python or R.

The final reason is that there probably are .NET-based devleopment teams out there that want to start using a few bits and pieces of ML, but don't want to invest training time in learning a new language. Hopefully these articles will help those teams out with a few hints as to how to get up and running.

One last note - I'm no expert on this subject. I'm an enthusastic, coffee-fueled amateur when it comes to data science. If anyone reading this realises I've made a mistake, or missed out on a better way to do things, then I'd love to hear from you. We're hopefully all in this to keep learning & improving. I certainly am. I'm always grateful to anyone that can help me improve my knowledge of a subject.

Bayes and his Algorithm

As well as everything else, I'm a bit of a History nerd, so I wanted to start by talking a little about the background to the Bayes algorithm. It was discovered by Thomas Bayes, an English Mathematician from the 1700s. So far as I'm aware, we don't know a great deal about his life - there's no certainty that we even have a picture of any kind to show us what he looked like.

Like a lot of educated men of that era, he was a member of the clergy. He published a few papers offering opinions on mathematical topics, but that was about it during his lifetime. His greatest contribution to the field was a set of notes he never intended for publication, that were nevertheless released to the public after his death. It's from these that we get the famous Bayes Algorithm.

What is it?

Like a lot of Machine Learning algorithms, the Naive Bayes Algorithm relates to probability. If you want a C# analogy, it's a way of doing a GroupBy of the properties of a set of probabilities. Consider this example:

For some reason, we've got a set of 13 bags of Doctor Who stories, one for each Doctor (yes, I know it's a little more complicated than 13, bear with me) and we want to know for any given bag, what the chances are of it being a Dalek story. It'd look something like this:

1st Doctor (Hartnell) - 5/29 = 0.172 = 17.2%
2nd Doctor (Troughton) - 2/21 = 0.095 = 9.5%
3rd Doctor (Pertwee) - 4/24 = 0.167 = 16.7%
4th Doctor (T. Baker) - 2/41 = 0.049 = 4.9%
5th Doctor (Davison) - 1/20 = 0.05 = 5.0%
6th Doctor (C. Baker) - 1/8 = 0.048 = 4.8%
7th Doctor (McCoy) - 1/12 = 0.083 = 8.3%
8th Doctor (McGann) - 0/2 = 0 = 0%
9th Doctor (Eccleston) - 2/10 = 0.2 = 20%
10th Doctor (Tennant) - 3/36 = 0.083 = 8.3%
11th Doctor (Smith) - 3/39 = 0.172 = 17.2%
12th Doctor (Capaldi) - 1/35 = 0.172 = 17.2%
13th Doctor (Whittaker) - 5/26 = 0.192 = 19.2%

Before there are any arguments, I'm only counting proper Dalek stories, where they're the main villains. I'm not counting every episode that features one wandering around in the background somewhere. Probably. Like everything else in Doctor Who it's probably up for debate. Even the number of episodes is a little arbitrary here.

So what we have so far is a list of probabilities that say "given this is a (for example) First Doctor story, the chances it is a Dalek story is X". Using the Bayes algorithm, we can turn that on its head, and say "Given this is a Dalek story, the chances it is a First Doctor story is Y".

In case you're curious, the actual formula looks like this: P(c|x) = (P(x|c) * P(c)) / P(x). That's a maths thing, though. I'm not interested in precise mathematical definitions, I want the practical, engineers version, and that look more like this:

Select a Probability of "Daleks Given 1st Doctor" (i.e. 5/29 = 0.172)
Multiply by probability of any random story being a 1st doctor story (i.e. 29/299 = 0.097 = 9.7%)
Divide by the probabilty of any random story being a Dalek story (i.e. 30/303 = 0.099 = 9.9%)

Following our example, that would mean that the probability a random Dalek story is a First Doctor story is 0.172 * 0.097 / 0.099 = 0.169 or 16.9%.

If you want to see some C# code to work out the results for all of the Doctors, it looks like this:

				
// Item1 = # of Doctor
// Item2 = # of Stories
// Item3 = # of Dalek Stories
var DoctorDalekData = new[]
{
	(1, 29, 5),
	(2, 21, 2),
	(3, 24, 4),
	(4, 41, 2),
	(5, 20, 1),
	(6, 8, 1),
	(7, 12, 1),
	(8, 2, 0),
	(9, 10, 2),
	(10, 36, 3),
	(11, 39, 3),
	(12, 35, 1),
	(13, 26, 5)
};

var probabilityOfDalekGivenDoctor = DoctorDalekData.Select(x =>
	(Doctor: x.Item1, Probability: x.Item3 / (decimal)x.Item2)
);

var totalStories = DoctorDalekData.Sum(x => x.Item2);
var totalDalekStories = DoctorDalekData.Sum(x => x.Item3);

var probabilityOfDoctor =
	DoctorDalekData.Select(x => (Doctor: x.Item1, Probability: (decimal)x.Item2 / totalStories))
		.ToDictionary(x => x.Doctor, x => x.Probability);
var probabilityOfDaleks = totalDalekStories / (decimal)totalStories;

var probabilityOfDoctorGivenDaleks = probabilityOfDalekGivenDoctor
	.Select(x => (
			Doctor: x.Doctor,
			Probability: x.Probability * probabilityOfDoctor[x.Doctor] / probabilityOfDaleks)
		);

var reportLines = probabilityOfDoctorGivenDaleks.Select(x =>
	$"{x.Doctor}, {Math.Round(x.Probability, 2)}"
);

var reportHeader = "Doctor, Probability";

var report = reportHeader + Environment.NewLine + string.Join(Environment.NewLine, reportLines);

The result of running this code looks like this:

Doctor, Probability
1, 0.17
2, 0.07
3, 0.13
4, 0.07
5, 0.03
6, 0.03
7, 0.03
8, 0
9, 0.07
10, 0.1
11, 0.1
12, 0.03
13, 0.17

Looks like if I were a betting man, and we were inexplicably betting on which Doctor we'd get from a random bag of all of the Dalek stories on DVD, I'd put my money on the 1st or 13th Doctors.

What's it For?

This was all good fun, but what's the point? For one, we can now take an array of data items with a list of properties, and quickly and easily calculate a set of probabilities arranged around that property. In our example, we went from a set of data about each Doctor, to a set of data based around all of the Dalek stories in the set, and how they break down by Doctor. The same way a C# GroupBy function can be used to repivot a dataset around a chosen property. Creating Pie Charts, or other reports is definitely a useful feature of this algorithm.

There's another use. We can also use this to classify things. In Part 2, I'll show you how to use a Naive Bayes classifier identify an author by their writing style, among other things.

Until next time...