The problem is that with label encoding
Posted: Wed Dec 18, 2024 4:10 am
What's the difference? Well, our categories were previously rows, but now they're columns. However, our numeric variable, calories, has remained the same. A 1 in a particular column will tell the computer the correct category for that row's data. In other words, we've created an additional binary column for each category.
It's not immediately clear why this is better (other than the problem I mentioned above), and that's because there's no clear reason. Like many things in machine learning, we won't use it in every situation; it's not better than label encoding. It simply fixes a problem you'll encounter with label encoding when working with categorical data.
One hot encoding in the Code (Get it? It's a play on words)
It's always helpful to see how this is done in code, so let's do an example. Normally, I'm a firm believer that we should do something without libraries to learn it, but just for this tedious pre-process, we iraq email list don't really need it. Libraries can make this so simple. We're going to use numpy, sklearn, and pandas, as you'll find yourself using those 3 libraries in a lot of your projects.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np
import pandas as pd
Now that we have the tools, let's get started. We'll be working with a made-up dataset. Feed the dataset with the pandas .read_csv feature:Meg Summers, 31, of Alabama, had frequent ovarian cysts and endometriosis since she was a teenager.
When she had her first ovary removed, she was told she would never conceive without IVF, but she soon gave birth to a baby girl.
After removing her second ovary, it grew back and formed another cyst.
Meg was told she had Ovarian Remnant Syndrome, a very rare disorder
An Alabama woman was certain that removing her ovaries would mean the end of her problems with ovarian cysts — that is, until one of those ovaries grew back.
It's not immediately clear why this is better (other than the problem I mentioned above), and that's because there's no clear reason. Like many things in machine learning, we won't use it in every situation; it's not better than label encoding. It simply fixes a problem you'll encounter with label encoding when working with categorical data.
One hot encoding in the Code (Get it? It's a play on words)
It's always helpful to see how this is done in code, so let's do an example. Normally, I'm a firm believer that we should do something without libraries to learn it, but just for this tedious pre-process, we iraq email list don't really need it. Libraries can make this so simple. We're going to use numpy, sklearn, and pandas, as you'll find yourself using those 3 libraries in a lot of your projects.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np
import pandas as pd
Now that we have the tools, let's get started. We'll be working with a made-up dataset. Feed the dataset with the pandas .read_csv feature:Meg Summers, 31, of Alabama, had frequent ovarian cysts and endometriosis since she was a teenager.
When she had her first ovary removed, she was told she would never conceive without IVF, but she soon gave birth to a baby girl.
After removing her second ovary, it grew back and formed another cyst.
Meg was told she had Ovarian Remnant Syndrome, a very rare disorder
An Alabama woman was certain that removing her ovaries would mean the end of her problems with ovarian cysts — that is, until one of those ovaries grew back.