Welcome!

About Me
Projects
Case Studies

About Me

I’m Sherry. I am a Data Scientist with a background in business operations in the consumer goods industry.
I have an insatiable sense of curiosity and I use Data Science to answer my many questions: Did the opening of the Chase Stadium bring more crime and mayhem to my beloved Dogpatch neighborhood in San Francisco? Let’s find out with some hypothesis testing!
I’ve always enjoyed streamlining processes, and have experimented with machine learning to do so: I created an Anime Recommender System to make finding my next anime faster. The application had over 4500 impressions within one week of posting on LinkedIn. Check out the WebApp to find your next anime!

Projects

What Else Did The Chase Stadium Bring to SF?

An exploratory data analysis comparing crime incidents and fire department service call volume in the Dogpatch and Mission Bay neighborhoods in San Francisco. Used Welch’s T-Test and Mann Whitney U Test to confirm statistical significance between crime/fire during dates with events at the Chase Center versus not.
Technologies Used: Numpy, Pandas, Scipy, Matplotlib

Github Repo
PPT

How Are Airbnb Listing Prices Determined?

Analyzed current Airbnb daily listing prices in San Francisco to predict prices of future listings using Linear Regression, Random Forest, and Gradient Boosting. Performed feature engineering and hyper-parameter tuning, eventually achieving a 75% improvement on baseline RMSE of $207, for a final RMSE of $54 on the cross-validated model with Random Forest.
Technologies Used: Numpy, Pandas, Matplotlib, Seaborn, Sklearn, NLTK

Github Repo
PPT
Video Presentation

Your Anime Match Maker: An Anime Recommender System

Created an Anime Recommender system using MyAnimeList’s dataset of animes released prior to 2018. Recommender utilizes popularity-based, content-based similarity between anime (calculated with cosine similarity), and matrix factorization collaborative filtering techniques with Spark’s ALS model.
Technologies Used: Numpy, Pandas, Matplotlib, Sklearn, Spark ML, AWS Sagemaker

Check out the Web App! (It takes a second to load)

Github Repo
PPT
Video Presentation

Case Studies

Classifying Fraud

Maximized profit for a ticket sales and distribution company by identifying fraudulent users and posts using Logistic Regression, Random Forest, Gradient Boosting, MLP Classifier, and XGBoost. Eventually achieving an AUC of 99.4%.
Technologies Used: Numpy, Pandas, Matplotlib, Seaborn, Sklearn, Github

Sentiment Classification

Classified Sentiments of movie reviews using Text Classification techniques: Stopword removal, Stemming, Clustering, Topic Modeling, and Machine Learning Models: Naive Bayes, Logistic Regression, Random Forest, and Gradient Boosting. Eventually achieving an accuracy of 88%.
Technologies Used: Numpy, Pandas, Matplotlib, Seaborn, Sklearn, NLTK, Github

Preventing Churn

Using classification to identify potential customers with potential to churn. Utilized Logistic Regresssion, Random Forest, and Gradient Boosting, eventually achieving an accuracy of 78%.
Technologies Used: Numpy, Pandas, Matplotlib, Sklearn, Github

Sherry Duong