Curiously exploring the world with Python and SQL.
Email: sherryduong@gmail.com
LinkedIn: sherry-duong
Github: sherryduong93
View My Resume
I’m Sherry. I am a Data Scientist with a background in business operations in the consumer goods industry.
I have an insatiable sense of curiosity and I use Data Science to answer my many questions: Did the opening of the Chase Stadium bring more crime and mayhem to my beloved Dogpatch neighborhood in San Francisco? Let’s find out with some hypothesis testing!
I’ve always enjoyed streamlining processes, and have experimented with machine learning to do so: I created an Anime Recommender System to make finding my next anime faster. The application had over 4500 impressions within one week of posting on LinkedIn. Check out the WebApp to find your next anime!
An exploratory data analysis comparing crime incidents and fire department service call volume in the Dogpatch and Mission Bay neighborhoods in San Francisco. Used Welch’s T-Test and Mann Whitney U Test to confirm statistical significance between crime/fire during dates with events at the Chase Center versus not.
Technologies Used: Numpy, Pandas, Scipy, Matplotlib
Analyzed current Airbnb daily listing prices in San Francisco to predict prices of future listings using Linear Regression, Random Forest, and Gradient Boosting. Performed feature engineering and hyper-parameter tuning, eventually achieving a 75% improvement on baseline RMSE of $207, for a final RMSE of $54 on the cross-validated model with Random Forest.
Technologies Used: Numpy, Pandas, Matplotlib, Seaborn, Sklearn, NLTK
Github Repo
PPT
Video Presentation
Created an Anime Recommender system using MyAnimeList’s dataset of animes released prior to 2018. Recommender utilizes popularity-based, content-based similarity between anime (calculated with cosine similarity), and matrix factorization collaborative filtering techniques with Spark’s ALS model.
Technologies Used: Numpy, Pandas, Matplotlib, Sklearn, Spark ML, AWS Sagemaker
Check out the Web App! (It takes a second to load)
Github Repo
PPT
Video Presentation
Maximized profit for a ticket sales and distribution company by identifying fraudulent users and posts using Logistic Regression, Random Forest, Gradient Boosting, MLP Classifier, and XGBoost. Eventually achieving an AUC of 99.4%.
Technologies Used: Numpy, Pandas, Matplotlib, Seaborn, Sklearn, Github
Classified Sentiments of movie reviews using Text Classification techniques: Stopword removal, Stemming, Clustering, Topic Modeling, and Machine Learning Models: Naive Bayes, Logistic Regression, Random Forest, and Gradient Boosting. Eventually achieving an accuracy of 88%.
Technologies Used: Numpy, Pandas, Matplotlib, Seaborn, Sklearn, NLTK, Github
Using classification to identify potential customers with potential to churn. Utilized Logistic Regresssion, Random Forest, and Gradient Boosting, eventually achieving an accuracy of 78%.
Technologies Used: Numpy, Pandas, Matplotlib, Sklearn, Github