This project has three main sections, the customer segmentation report, the supervised learning model, and the Kaggle Competition.

For the customer segmentation report, the demographics data for customers of Arvato, a company in Germany is compared to the general population to create customer segments.

Then, using the previous analysis, a supervised learning model is built to determine which individuals will respond to Arvato’s marketing campaign.

The model is then used to make predictions on the test dataset as part of a Kaggle Competition.

The Data

  • Azdias: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).

You have been given some text files to go through, What do you do?

Now, if it were 10, 20 or even 30 you could read through all of them, in say no more than a week but what happens when those documents number in the hundreds or thousands?

Well, definitely, not by hand.

With the current Covid-19 pandemic, the COVID-19 Open Research Dataset Challenge (CORD-19) was announced on Kaggle. The dataset contains a corpus of 200,000 scholarly articles about COVID-19, SARS-CoV-2, and other related coronaviruses. Which they have a problem sifting through.

Source: unicode website

Have you ever tried to pandas to open a csv file only to get an error?

