

When working with large sets of categorical features, One-Hot can turn a 5 column dataframe into a 50 column dataframe, which is incredibly hard to work with! On top of that, typically with higher stakes in categories, One-Hot’s effectiveness may drop dramatically. Additionally, One-Hot really shines in this exact light. This is because first and foremost, One-Hot-Encoded data takes up a lot of memory and disk space compared to the other algorithms available. Typically, I use One-Hot in situations where I have as few categories as possible.

One-Hot-Encoding, also called One-Hot, or Dummy-Encoding takes a very radical approach to dealing with categorical variables. If you’re new to machine-learning, one trick you should definitely snatch up as soon as possible is the ability to One-Hot-Encode a Data-Frame. Without further ado, let’s look at some encoders! Typically, whenever machine-learning is being done with strings, a Data-Scientist will be working with an encoder. Asking a computer to interpret words, especially sentences with subjective meaning or emotion is like having the Cookie monster eat celery įortunately, there is a solution to this problem - there are many different ways that you can approach turning words into numbers for analysis! Though doing so might not allow a computer to analyze certain things about words, it can certainly help with solving common machine-learning problems that you may encounter in the educational grind that is Data-Science. Computers speak quantitatively, rather than qualitatively. The problem with the combination of data and strings and words is that words cannot directly be analyzed by an artificial brain.

As a result, it is unfortunately incredibly common to come across words (or “ strings” in “ beep boop” language) rather than numbers when working with data-sets, and this is even true of the cleanest data-sets available today. Unfortunately, humans went and developed the phonetic alphabet before they started talking in binary, or “ beep boop” speech. In a perfect world, all programmers, scientists, data-engineers, analysts, and machine-learning engineers alike dream that all data could arrive at their doorstep in the cleanest form possible.
