Hi, I'm Min Shi.

A
Welcome to my GitHub page! I'm a Ph.D. and master's student who is interested in dealing with diverse types of data and passionate about finding insights from data analytics. Skilled in SQL query, big data processing, statistical analytics, ML, NLP, DL, time series, and data visualization. Actively seeking a 2024 Data Analyst / Data Scientist position. Let's collaborate in a dynamic, innovative environment to apply data-driven insights and make a positive impact. Explore my repositories and projects, and reach out for potential collaborations. Together, we'll harness the power of data!

About

I am advancing my expertise in Business Analytics with a focus on Data Science at UTD. My PhD in Political Science and comprehensive training in Data Analytics enable me to navigate and simplify complex data-driven challenges. I have a proven track record, contributing to publications that apply sophisticated statistical methods to global health and international political economy.

In my recent role as a Data Scientist Student Consultant, I spearheaded the development of an AI-driven chatbot using advanced NLP techniques, which enhanced customer interaction and increased engagement by 15% through improved response efficiency. My work, employing NLP and machine learning via the XGBoost model, significantly advanced the company’s customer service capabilities.

As a Marketing Analyst intern, where I utilized tools like MySQL and Microsoft Visio to streamline data processes and create impactful business analytics reports. Besides, I have led and collaborated on diverse group projects, ranging from designing payroll management systems to handling big geospatial data using Hadoop and Spark.

Eager to bring my analytical acumen and innovative approach to your team, I am prepared to leverage my skills to enhance your company’s performance and facilitate global expansion. I look forward to contributing to your team, pushing the limits of what data can achieve in the tech industry.

  • Programming Languages: Python, R, SQL, SAS, Stata, Tableau
  • Databases: MySQL, PostgreSQL, MongoDB, Amazon RDS
  • Big Data: Hadoop, Sqoop, Hive, Impala, Pig, Spark
  • Automation: Alteryx, Appian, Accelq, Uipath
  • Visualization Tools: Tableau, Power BI, Jupyter Notebook, R Shiny
  • Libraries: NumPy, pandas, Matplotlib, scikit-learn, SciPy, Statsmodels, nltk, PyTorch, Tensorflow
  • Certificate: Graduate Certificate in Applied Machine Learning at UTD, Google Data Analytics Certificate, AWS Certified Cloud Practitioner Certificate, Alteryx Designer Core Certificate, Appian Certified Associate Developer, ACCELQ Automation Engineer Certification
  • Languages: English, Chinese, Japanese

Experience

Research Assistant
  • Took responsibility for data manipulation and model building for 10+ global health and policy analytics projects.
  • Directed the data gathering processes, utilizing diverse methods like sampling, surveys, and web scraping.
  • Developed robust statistical models, including multi-variable regression, fixed-effect regression, difference-indifference, and time-series models, to facilitate correlation and causal inference studies.
  • Oversaw a team of over five junior research assistants, ensuring smooth collaboration and timely completion.
  • Tools: Causal Analytics, A/B Testing, Data Collection, Data Analytics, Python, R, Stata, Microsoft Office, Leadership, Communication
May 2020 - Present | Dallas, UTD
Data Science Student Consultant
  • Led a team in creating an AI-driven chatbot to enhance customer engagement for online interactions.
  • Employed NLP and MySQL for analyzing and querying an extensive database containing over 10 million entries.
  • Achieved 25% improvement in response efficiency and provided 99% accurate predictions using XGBoost model.
  • Contributed to a 15% rise in user engagement, increasing customer satisfaction and bolstering company’s image.
  • Tools & Skills: Python, SQL, NLP, ML, UI Design, Leadership, Communication
August 2023 - December 2023 | Dallas, UTD
Marketing Data Analyst
  • Served as a Data Analyst Intern responsible for data management, data visualization, and business analysis.
  • Improved the efficiency of data extraction by 40% through data optimization in MySQL.
  • Employed Microsoft Visio to visualize intricate network structures and aided in product comprehension.
  • Produced Business Intelligence (BI) reports, offering insights based on user structures and competitor analysis.
  • Tools: MySQL, Microsoft Visio, Microsoft Office
July 2017 - August 2017 | Jinan, China

Projects

US Top 4 Airlines Financial Performance Analytics Project
US Top 4 Airlines Financial Performance Analytics Project

Comprehensive Financial Performance Analytics of the Top Four US Airlines

Accomplishments
  • Tools: Data Mining · Business Model Analytics
  • Analyzed financial data from a 20-year dataset of over 10,000 rows, covering net income, revenue, and expenses across the US airline industry. This deep dive provided insights into long-term financial trends and shifts.
  • Conducted financial performance analytics for the top 4 airlines, identifying key turning points related to major events, alliances, and partnerships over the period.
  • Assessed operational trends and competitive positioning of each airline, deriving specific business model recommendations based on a two-decade comparison with competitors.
kaggle plant competition project
Kaggle Plant Pathology Competition

Leveraging Deep Learning CNNs for Disease Diagnosis in Apple Orchards

Accomplishments
  • Tools: Deep Learning · CNN · Transfer Learning
  • Utilized transfer learning on CNNs with 5,590 images in 12 categories, enhancing disease identification accuracy.
  • Conducted image transformation, including rotation, flipping, zooming, and noise injections to augment data.
  • Fine-tuned ConvNext DL CNN models and achieve 86.8% accuracy, securing a Top 3 ranking in the competition.
django web app
Python Web Scraping & Natural Language Processing Project

Leveraging Web Scraping for Business Prediction via NLP & ML Approaches

Accomplishments
  • Tools: Web Scraping · Natural Language Processing · Machine Learning Models
  • Created an automated web scraping tool to extract more than 7,000 WSJ news articles using specified keywords.
  • Analyzed WSJ articles employing Count Vectorizer, Tfidf Vectorizer, and n-grams Count Vectorizer.
  • Implemented Naïve Bayes and Random Forest models; achieved a notable ROC AUC value and an increase in S&P 500 stock index prediction accuracy by 12%.
  • Demonstrated consistent accuracy across various vectorizers, suggesting the potential use of NLP in forecasting stock price changes based on WSJ news articles related to U.S. trade.
Big Data Project

Geospatial Truck Fleet Big Data Analytics and Visualization

Accomplishments
  • Tools: Hadoop ecosystem · Tableau · R
  • Used big data Hadoop ecosystem to process geospatial data ingestion, transformation, and database creation.
  • Performed data exploration and visualization in Tableau by connecting to Hadoop ecosystem server.
  • Modeled how factors affect the truck driver risk factor, drew a final report and proposed suggestions on how to lower the probability of large trucks accidents.
Conagra Brands’ Project

Extensive Analysis of Table Spreads Industry

Accomplishments
  • Tools: SAS · Tableau · Statistical Regression Analysis · Time Series Analysis
  • Researched over 1.3 million records to identify key metrics contributing to the sales of top brands
  • Evaluated strengths and weakness of Conagra Brands compared to competitors in each sub-category
  • Built Machine Learning and Time Series models to predict future directions for Conagra Brands
Database Project

Payroll Management System Database Design via MySQL

Accomplishments
  • Tools: MySQL
  • Led a group of five in conducting business requirements analysis and designing a payroll management database with MySQL consisting of 13 tables.
  • Increased efficiency in extract-transform-load and payroll database management by 100% via stored functions, procedures, and triggers.
Goldman Sachs Project via Alteryx

Goldman Sachs Global Business Analytics and Prediction via Python and Alteryx

Accomplishments
  • Tools: Alteryx · Python · Business Analytics · Time Series Analysis
  • Researched and generated datasets of US and worldwide inflation during pre-pandemic and post-pandemic periods via Python from raw datasets
  • Conducted data cleaning and preprocessing, built time series ARIMA and ETS models to forecast trends in Consumer Price Index (CPI) and Producer Price Index (PPI) via Alteryx
  • Presented key trends and findings on inflation and consumer prices and the further impact on Goldman Sachs, provided insights and recommendations on global operation strategies based on the analysis
COVID and Political Economy Project

Analysis of the Effect of COVID-19 on US Trade and US Firms

Accomplishments
  • Tools: Deep Learning · Machine Learning · Statistical Regression Analysis · Communication
  • Synthesized data and created fixed-effect regression models to identify correlations and causal mechanisms
  • Developed and Implemented machine learning and deep learning models to conduct counterfactual analysis
  • Presented findings at the 2023 Applied Data Science international conference
US-China Trade War and US Firms Project

Modeling U.S.-China Trade War’s effect on US Multinational Corporations

Accomplishments
  • Tools: Python · R · SQL · Stata · Time Series Analysis
  • Generated and managed a new database using PostgreSQL and performed data analysis in Python
  • Built time series GARCH models in Stata to examine the effects of U.S.-China trade conflicts on US firms
  • Presented the findings at the 2022 International Society for Data Science and Analytics Conference
django web app
Tableau Project

COVID-19 Worldwide Cases Synchronous Dashboard using Tableau

Accomplishments
  • Tools: Tableau · Data Visualization
  • Designed a synchronous Tableau dashboard with advanced interactive functions to explore COVID-19 severity.
  • Utilized Tableau to probe the correlation between factors and the severity of COVID-19 by country.
Multinational Corporation Database

U.S. Multinational Corporation Trade Database with China

Accomplishments
  • Tools: SQL · Databases · MySQL
  • The database aims to serve people interested in US-China trade war and its effect on US-China trade volumes, and the impact on US multinational corporations (MNCs) which depend on global value chains (GVC) heavily.
  • The database provides mainly two types of data. The first type is the macroscopic data. Specifically, this database provides US-China monthly trade data by commodity, the volume and percentage of products under tariff data between US and China. It also contains the data about US annual trade with all countries and the basic development indicator information of these countries, including GDP, population, tariff rate in general, and tariff rate for manufactured products. The second type of data is the microcosmic data about MNCs, including S&P 500 company list with detailed information, such as stock symbol, location, sector, industry, etc., S&P 500 company stock price time-series data, fortune 500 company list and their annual revenues data, fortune 500 company stock price time-series data, top 20 companies list based on their level of sale in China, and top 20 companies list based on their share of sale in China.
  • People could utilize this dataset to explore how U.S.-China trade relations change in the 21st century, the connection between U.S.-China trade and their tariff change, the differences in the tendencies of US trade with different companies, how U.S.-China trade relations affect U.S. multinational corporations (MNCs).
Global National Happiness Project

What Factors Affect People’s National Happiness Score?

Accomplishments
  • Tools: R · R Markdown
  • The World Happiness Report is the most professional annual report about countries’ happiness index and has attracted attention from policymakers from multiple areas. The values of happiness scores are based on respondent ratings of their own lives. And each report also includes six basic factors covering financial generation, social back, life anticipation, flexibility, nonattendance of debasement, and liberality.
  • This paper explores the effect of other potential factors, including regime type, demographic factors, and COVID-19 severity. The statistical findings indicate that the more democratic one country is, and the larger the population size, its citizens tend to feel happier. Besides, the population density, population net change, and population density net change are negatively correlated to one country’s happiness score.
  • Two interesting findings through data visualization are: Firstly, the relationship between democracy and one country’s happiness index follows a U shape rather than a positive linear line, different from the statistical results and no support to H1. Secondly, the population size and population density are negatively related to one country’s happiness index, supporting H2a and H2b. The results indicate that statistical regression results are not reliable in all cases, and data visualization is necessary to examine and interpret the statistical regression results more accurately.

Skills

Programming Languages

Python
R
SQL

Tools

SAS
Stata
Tableau
Alteryx
Alteryx
Alteryx
Alteryx

Database & Big Data

Libraries

NumPy

Certificates

Graduate Certificate in Applied Machine Learning at UTD
Google Data Analytics Certificate
AWS Certified Cloud Practitioner Certificate
Alteryx Designer Core Certificate
Appian Certified Associate Developer
ACCELQ Automation Engineer Certificate

Languages

English
Chinese
Japanese

Education

The University of Texas at Dallas

Dallas, USA

Degree: Ph.D. in Political Science
GPA: 3.95/4.0

Degree: Master of Science in Social Data Analytics and Research
GPA: 3.95/4.0

Degree: Master of Science in Business Analytics
GPA: 4.0/4.0

Shandong University

Jinan, China

Degree: Master of Law in International Politics
GPA: 88.78/100

Degree: Bachlor of Arts in Japanese
GPA: 87.37/100

Contact

-->