Math For Data Science: Confusion Matrix & Sets

Hello Scientists

Recently I was applying for data science positions. At many interview and technical exams, you will be asked about some math basics that are important for any data Scientist. So today let’s study math! What is Set , cardinal…


This content originally appeared on DEV Community and was authored by Manar Abdelkarim

Hello Scientists

Recently I was applying for data science positions. At many interview and technical exams, you will be asked about some math basics that are important for any data Scientist. So today let's study math! What is Set , cardinality , intersection , and union, then, we will talk about confusion matrix

Cute Cat GIF • Purrfect pillow for Cat with a computer mouse and a calculator. Very soft and comfy

Let's get started:

What is a Set

Set is a collection of things, set is made up of elements.

Set can be a group of similar or different things. For example, you can have a set of books or a set of random things in a box.

let's look at some symbols that are commonly used with sets

{}   -> The representation of set
∈   -> Membership/ Element of a set
∉   -> not element of/ no set membership
∅   -> Empty set (Phi)
∪   -> Union
∩   -> Intersection

Now let's take examples to understand the set and the symbols:

If we have a set of students of a math course and their names are : Manar, Noor, Raneem, Noura.

Then, to represent that set we write :

math_students = {Manar, Noor, Raneem, Noura}

Set representation in Python :

math_students = {'Manar', 'Noor', 'Raneem', 'Noura'}

Note that sets in Python can be from different types, such as :

my_set = {'A',4,True,5.78}

And Set items in Python are unordered, unchangeable, and do not allow duplicate values.

How many students above applied for the math course ?

The answer is 4. To represent it mathematically we use the word
" cardinality" which just means the size of the set. So, the cardinality of math_students is 4 and that's how we represent it:

|math_students| = 4

To find the cardinality in Python we write:

cardinality = len(math_students)

Note that the name cardinality here is just an example of a variable name

Now what is the relationship between Noura and math_students?

Noura is a member of the math_students
we represent it using ∈ like the way below :
Noura ∈ math_students

To check if there is a membership relationship in Python :

membership = "Noura" in math_students

Note that the name membership here is just an example of a variable name

What is the relationship between Dario and math_students ?

Dario is not a member of the math students set. That's how we represent it:

Dario ∉ math_students

To check the if there is no membership relationship in Python :

no_membership = "Dario" not in math_students

Note that the name no_membership here is just an example of a variable name

Now we have another set failed_students for the students who failed in the exam, but fortunately, no student has fail. So our new set is empty. To represent our empty set we write:

failed_students = ∅

In Python, the empty set looks like :

failed_students = {}

And its cardinality equal to zero :

print(len(failed_students))

output : 0

Now we reached the fun Part (Intersection and Union):

And we will start with intersection:

back to our math_class example, let's assume that we have another course which is Python, and the students who applied for the class are Manar, Tala, Dario, Raneem, Aseel
So now we have :

math_students = {Manar, Tala, Raneem, Noura}

Python_students = {Manar, Noor, Dario, Raneem, Aseel}

If you check the students' name (assuming that they are the same people),we will find out that Manar and Raneem are attending both math and Python classes

So we say the intersection between math_students and Python_students are Manar and Raneem .. To represent it mathematically:

math_students ∩ Python_students = {Manar, Raneem}

Note that the result of the intersection is a set

In Python, we represent the intersection using & symbol :

math_students & Python_students

output: {'Manar','Raneem'}

Let's imagine that we have another course for Geography, and the students are : Mark, Kitty, Tala and Keven

geography_studen = {Mark, Kitty, Tala, Keven}

What is the intersection between geography_studen and python_students ?
Well, the answer is: there is no intersection
And What did we say about how to represent an empty set ? by using Phi ∅ . So:

geography_studen ∩ python_students = ∅

And in Python :

geography_studen & python_students

output: set()

Now the Union
We have :

math_students = {Manar, Tala, Raneem, Noura}

Python_students = {Manar, Noor, Dario, Raneem, Aseel}

The union is all the members of all the sets
which means here, all the members of math_students and all the members of Python_students.

To represent the Union we said that we will use ∪.
So, let's try to put them together :
math_students ∪ Python_students = {Manar, Tala, Raneem, Noura, Manar, Noor, Dario, Raneem, Aseel}

Ok that seems good but as we can notice, Manar and Raneem are duplicated, Why ? because Manar and Raneem are attending both the math and Python courses. in other words, because Manar and Raneem are the intersection between math_students and Python_students. So let's remove one Manar and one Raneem

math_students ∪ Python_students = {Manar, Tala, Raneem, Noura, Noor, Dario, Aseel}

Now Our Union is correct. What we just did is something called Inculsion Exclusions Formula.
The mathematical representation of the formula is:

|A ∪ B| = |A| + |B| - |A ∩ B|

And that's what we did. we added Python_students and math_students then we minus the interaction (remove the duplication)

To represent union in Python:

Python_students | math_students

output: {'Manar', 'Tala', 'Raneem', 'Noura', 'Noor', 'Dario', 'Aseel'}

And that's it for the symbols ?

Let's Explain the usage of sets in real-life. our first example is detecting Corona on People.

  • let's call all the people that we chose as a test sample -> X
  • And the set of people among X who are sick (have Corona) -> S
  • The set of people among X who are Healthy(don't have Corona) -> H

To represent S in mathematics we write :

S={x ∈ X : x has Corona}

The small x means one of the X set.
So the formula says: an x "a person from the sample test" that is a member of the X "the set of people of the sample test" and that x "the person" has Corona.

And to represent H in mathematics we write :

H={x ∈ X : x doesn't has Corona}

Now what is the possibility that one of the X people have Corona and doesn't have Corona at the same time ?
of course this is impossible

So we know that |S| ∩ |H| = ∅ and |S| ∪ |H| = X (the people in the test sample are either sick or healthy so if we put all the sick and healthy people together they will be all of our sample )

Now let's add our Corona detect prediction as we want to know how accurate is our CPR prediction.

  • So let the people from the sample who tested positive ->P And we represent them as P= {x ∈ X : x positive for Corona}

Note that positive here means that the CPR predict that someone has Corona.

  • The people from the sample who tested negative ->N And we represent them as N= {x ∈ X : x negative for Corona}

Now we have 4 probabilities:

  1. someone who has Corona and his test for Corona is positive.
  2. someone who does not have Corona and his test for Corona is negative.
  3. someone who has Corona but his test for Corona is negative.
  4. someone who does not have Corona but his test for Corona is positive.

these probabilities are called Confusion matrix

Confusion matrix:

A confusion matrix is a technique for summarizing the performance of a classification algorithm.

Calculating a confusion matrix can give you a better idea of what your classification model is getting right and what types of errors it is making.

Now Let's discuss the 4 possibility:
math

someone who has Corona and his test for Corona is positive.

The representation of this possibility is : |S| ∩ |P|
because S are the people who has Corona and P are the people who tested positive.
someone who has Corona and tested positive means we want to find someone in the both sets (S and P) which means the intersection.

If we have a girl called Sara, Sara is in the P set which means the CPR predicted that Sara is in the S set ( the set of sick people)
If Sara is tested positive we call this prediction "positive" and if she really sick "if she is in S set" then we say the prediction is True -> True Positive.

  • When the prediction of what we are trying to find is right we call it positive.
  • When the reality meets the prediction we call it True So |S| ∩ |P| = True Positive.

someone who does not have Corona and his test for Corona is negative.

The representation of this possibility is : |H| ∩ |N|
because H are the healthy people who don't have Corona and N are the people who tested negative.
someone who doesn't have Corona and tested negative means we want to find someone in the both sets (H and N) which means the intersection.

Let's assume that the girl called Sara is in the N set which means the CPR predicted that Sara is in the H set ( the set of healthy people)
If Sara is tested negative we call this prediction "negative" and if she really sick "if she is in S set" then we say the prediction is True -> True negative.

  • When the prediction of what we are not trying to find (or the opposite of what we want to predict) is right we call it negative. -When the reality meets the prediction we call it True

So |H| ∩ |N| = True Negative

someone who has Corona but his test for Corona is negative.

The representation of this possibility is : |S| ∩ |N|
because S are the sick people who do have Corona and N are the people who tested negative.

Let's assume that the girl called Sara is in the N set which means the CPR predicted that Sara is in the H set ( the set of healthy people)
If Sara is tested negative we call this prediction "negative" and because she really sick and not healthy "she is in S set" then we say the prediction is False -> False negative.

  • When the prediction of what we are not trying to find (or the opposite of what we want to predict) is right we call it negative.
  • When the reality does not match the prediction we call it False

So |H| ∩ |N| = False Negative

someone who does not have Corona but his test for Corona is positive.

The representation of this possibility is : |H| ∩ |P|
because H are the healthy people who don't have Corona and P are the people who tested positive.

Again, the poor Sara is in the P set which means the CPR predicted that Sara is in the N set ( the set of sick people)
If Sara is tested positive we call this prediction "positive" and because she really is healthy "she is in H set" and not sick then we say the prediction is False -> False positive.

  • When the prediction of what we are trying to find is right we call it positive.
  • When the reality does not match the prediction we call it False

So |H| ∩ |P| = False positive

Note that to make our prediction accurate we want to increase the True positive and True negative and decrease the False positive and the False negative.

Confusion matrix

I know that it is a bit of headache so here is another example you can think of:

we are making an app with machine learning algorithms that predict if a word is a bad - nasty word.

Again the word would be either bad word or not bad word and

Because we are searching for bad words, if we predict one we call the prediction positive and if we predict the opposite "not bad" we call the production negative.

If our prediction is True then we call the prediction True, and if our prediction is False then we call the production False.

So

  • our sample X is the words in a sentence.
  • G is the set of not bad words. -> G= {x ∈ X: X is not a bad word}.
  • B is the set of bad words. -> B= {x ∈ X: X is a bad word}.
  • P is for probability of bad words -> P= {x ∈ X: positive for bad words}.
  • P is for probability of not bad words -> P= {x ∈ X: negative for bad words}.

Then:

  • |B| ∩ |P|= True Positive
  • |B| ∩ |N|= False Negative
  • |G| ∩ |P|= False Positive
  • |G| ∩ |N|= True Negative

Bonus Info ?: What Is Venn Diagram?

Venn Diagram:

Is illustration the uses circles to show the relationship among infinites groups of things.

For Example:

A= {1,3,5,7}     B= {2,3,4,5,6}      C= {9,15}

venn

Now let's represent the Corona sets

venn

Finally, we reached the end. Hope you Enjoy and

References


This content originally appeared on DEV Community and was authored by Manar Abdelkarim


Print Share Comment Cite Upload Translate Updates
APA

Manar Abdelkarim | Sciencx (2021-08-30T20:34:21+00:00) Math For Data Science: Confusion Matrix & Sets. Retrieved from https://www.scien.cx/2021/08/30/math-for-data-science-confusion-matrix-sets/

MLA
" » Math For Data Science: Confusion Matrix & Sets." Manar Abdelkarim | Sciencx - Monday August 30, 2021, https://www.scien.cx/2021/08/30/math-for-data-science-confusion-matrix-sets/
HARVARD
Manar Abdelkarim | Sciencx Monday August 30, 2021 » Math For Data Science: Confusion Matrix & Sets., viewed ,<https://www.scien.cx/2021/08/30/math-for-data-science-confusion-matrix-sets/>
VANCOUVER
Manar Abdelkarim | Sciencx - » Math For Data Science: Confusion Matrix & Sets. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2021/08/30/math-for-data-science-confusion-matrix-sets/
CHICAGO
" » Math For Data Science: Confusion Matrix & Sets." Manar Abdelkarim | Sciencx - Accessed . https://www.scien.cx/2021/08/30/math-for-data-science-confusion-matrix-sets/
IEEE
" » Math For Data Science: Confusion Matrix & Sets." Manar Abdelkarim | Sciencx [Online]. Available: https://www.scien.cx/2021/08/30/math-for-data-science-confusion-matrix-sets/. [Accessed: ]
rf:citation
» Math For Data Science: Confusion Matrix & Sets | Manar Abdelkarim | Sciencx | https://www.scien.cx/2021/08/30/math-for-data-science-confusion-matrix-sets/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.