UpStage AI LAB 1st Competition - ML REGRESSION
Updated:
Digital Treasure Quest Team’s Journey in the Apartment Price Prediction Competition
1. Competition Info
1.1. Overview
Competition Overview
Goal
Develop a model to predict apartment prices in Seoul based on real transaction data.
Introduction
The House Price Prediction competition aims to develop a model that predicts real transaction prices of apartments in Seoul using provided data. Apartment prices are influenced by various factors such as proximity to rivers, parks, and shopping centers. The predictive model considers these factors to accurately forecast market prices and assist in real estate transactions.
Provided Datasets
- Apartment transaction data: Includes location, size, year of construction, nearby facilities, and transportation convenience.
- Subway station information: Provided by Seoul City.
- Bus stop information: Provided by Seoul City.
- Evaluation data: For validating model performance.
Available Algorithms
Participants can use various regression algorithms such as linear regression, decision trees, random forests, and deep learning.
Modeling Goal
Develop an accurate and generalized model to forecast market trends and support real estate decisions. Participants gain practical experience by evaluating model performance and understanding the correlations between various features.
Submission Format
Submit results as a CSV file with 9,272 rows of predicted apartment transaction prices.
1.2. Timeline
Project Duration
- July 9 (Tue) 10:00 ~ July 19 (Fri) 19:00
Key Milestones
- Competition Start: July 9 (Tue) 10:00
- Team Merge Deadline: July 10 (Wed) 10:00
- Development and Testing Period: July 9 (Tue) 10:00 ~ July 18 (Thu) 19:00
- Final Model Submission: July 19 (Fri) 19:00
Detailed Schedule
- July 9 (Tue): Dataset release and competition start
- Begin data exploration and preprocessing
- Form teams and allocate roles
- July 10 (Wed): Team merge deadline
- Start initial model development
- Complete data preprocessing
- July 11 (Thu): Model validation
- Train models and analyze initial results
- Discuss ways to improve model performance
- July 12 (Fri): Model improvement
- Try various algorithms
- Tune parameters
- July 13 (Sat): Model testing
- Evaluate model performance using validation data
- Analyze and correct errors
- July 14 (Sun): Feedback implementation
- Modify and improve model based on feedback
- Integrate additional data
- July 15 (Mon): Optimization
- Optimize the model and maximize performance
- Perform additional feature engineering
- July 16 (Tue): Review results
- Review and test the final model
- Begin analyzing and documenting results
- July 17 (Wed): Documentation
- Document model development process and results
- Final review and corrections
- July 18 (Thu): Final checks
- Final model inspection and submission preparation
- Final testing and results verification
- July 19 (Fri): Final model submission and competition end
- Submit the final model
- Prepare for results announcement
1.3. Evaluation
Evaluation Method
The competition is a regression task to predict the actual transaction prices of apartments. The models developed by participants are evaluated using RMSE (Root Mean Squared Error).
What is RMSE?
RMSE measures the average deviation between predicted and actual values. It is calculated by taking the square root of the average squared differences between predicted and actual values.
Evaluation Criteria
- Prediction Accuracy: How well the model captures the difference between actual and predicted prices.
- Low RMSE Value: A lower RMSE indicates better predictive performance.
Calculation Method
Context
In the context of apartment transactions, RMSE quantitatively reflects how closely the regression model’s predictions match actual transaction prices, playing a crucial role in assessing the model’s performance.
2. Winning Strategy
2.1. DTQ Team’s Pros and Cons
Pros
- Team members with diverse experience
- High average age
- High acceptance of AI assistants
Cons
- Low experience in team-based R&D using Git
- Low experience in Python-based R&D
- Low experience in machine learning/deep learning R&D
- Limited domain knowledge related to the competition topic
2.2. DTQ Team’s Strategic Approach
- Utilize AutoML tools like DataRobot to guide feature engineering and model selection.
- Each team member develops different machine learning models based on their expertise.
2.3. DTQ Team’s Culture & Spirit
- The goal of participating in the competition is to gain knowledge and experience in machine learning R&D through individual learning.
- Understand and respect each team member’s situation and contributions.
- Actively use AI assistants to maximize individual productivity.
- Avoid sacrificing individual schedules or resources for the team’s overall goal.
- Ensure that each team member makes at least one submission.
3. Components
3.1. Directory
- code: Source code and related documents for each team member’s experiments
- docs: Team documents (presentations, reference materials)
- images: Attached images
- README.md: Digital Treasure Quest team’s Apartment Price Prediction competition journey
4. Data Description
4.1. Dataset Overview
Training Data
- File name: train.csv
- Rows: 1,118,822
- Features: 52
- Numeric: 22
- Text: 5
- Categorical: 18
- Date: 7
- Size: 244 MB
Prediction Data
- File name: test.csv
- Rows: 9,272
- Features: 51
- Numeric: 22
- Text: 5
- Categorical: 17
- Date: 7
- Size: 2.46 MB
4.2. EDA
Feature Description
Detailed descriptions of features, excess zeros, outliers, disguised missing values, inliers, and target leakage.
4.3. Feature Engineering
Methods Considered
- One-Hot Encoding
- Missing Values Imputed
- Smooth Ridit Transform
- Binning of numerical variables
- Matrix of char-grams occurrences using tfidf
Feature Selection
Selected input features and target feature details.
5. Modeling
5.1. Model Selection
Model Validation Stability
To prevent overfitting, models are validated using k-fold cross-validation and a holdout sample.
Data Partition Methodology
- Main: Random sampling for data partition.
- Additional: Time-series-based sampling based on UpStage AI Lab mentoring insights.
Selected Models
- eXtreme Gradient Boosted Trees Regressor
- Keras Slim Residual Network Regressor
- Light Gradient Boosted Trees Regressor
5.2. eXtreme Gradient Boosted Trees Regressor (DataRobot)
Modeling Descriptions
Detailed description of the Gradient Boosting Machines and XGBoost.
Modeling Process
Hyperparameters
Details of hyperparameters and their best values.
Feature Impact
Feature impact visualization.
Word Cloud
Word cloud visualization.
5.3. Keras Slim Residual Network Regressor (DataRobot)
Modeling Descriptions
Description of neural networks and the specific Keras Slim Residual Network model.
Modeling Process
Neural Network
Neural network structure visualization.
Hyperparameters
Details of hyperparameters and their best values.
Training
Training process visualization.
Feature Impact
Feature impact visualization.
Word Cloud
Word cloud visualization.
5.4. Light Gradient Boosted Trees Regressor (DataRobot)
Modeling Descriptions
Description of LightGBM and its benefits.
Modeling Process
Hyperparameters
Details of hyperparameters and their best values.
Feature Impact
Feature impact visualization.
Word Cloud
Word cloud visualization.
5.5. PyTorch Residual Network Regressor (By 박석)
Modeling Descriptions
Description of using PyTorch for Residual Network modeling.
Modeling Process
Hyperparameters
Details of hyperparameters and their best values.
Training
Training process visualization.
Trial History
Details of each trial with changes and results.
Additional Trial Shared with UpStage AI Stages Community
Links to shared insights and trials.
5.6. XGBoost (By 백경탁)
Daily Log
Daily logs documenting key activities and insights.
2024.07.16
Selection of important features.
2024-07-17
Imputation of missing values and data handling.
2024-07-19
Log transformation of target values and time-series splitting.
Trial History
Details of each trial with changes and results.
Mentoring List
Key mentoring insights and actions taken.
5.7. Light GBM (By 한아름)
Trial History
Baseline code comparisons and results of Light GBM model with Time Series K-Fold data splitting.
Mentoring List
Key mentoring insights and actions taken.
5.8. Baseline Code Enhancement (By 이승현)
TBD
6. Result
Leader Board
Final - Rank 3
Leaderboard final position and insights.
Submit History
Submission history with insights and learning points.
Presentation
Presentation details.
Personal Retrospective
Personal Retrospective Writing Guidelines
- The personal retrospective in the report aims to summarize the technical challenges attempted during the competition, lessons learned during the learning process, limitations and challenges faced, etc., to reflect on what and how I tried as a learner and what I gained.
- In practice, it’s important not only to do the work well but also to communicate logically and continue to grow based on continuous records. We hope you become effective workers through AI technology learning!
- Example:
- How did I achieve my learning goals?
- What were my team’s and my learning goals?
- Personal learning aspects
- Collaborative learning aspects
- How did I improve the model?
- Knowledge and techniques used
- What did I achieve and realize as a result of my actions?
- What new changes did I attempt compared to before, and what were the effects?
- What were the limitations and regrets?
- What will I try newly in the next competition based on the limitations/lessons learned?
- How did I achieve my learning goals?
Retrospective
During the competition, my primary goal was to apply and enhance my skills in data analysis and machine learning, with a focus on predicting apartment prices accurately. Our team aimed to leverage diverse experiences and the strengths of various algorithms to achieve this.
What I Did to Achieve My Learning Goals
- Team Learning Goals: Our team’s main goal was to build a robust predictive model using advanced machine learning techniques.
- Personal Learning Goals: My personal objective was to deepen my understanding of feature engineering and model optimization.
How I Improved the Model
- Utilized One-Hot Encoding and Smooth Ridit Transform for categorical variables.
- Applied rigorous feature selection to enhance model performance.
- Experimented with various algorithms, including eXtreme Gradient Boosted Trees and Light Gradient Boosted Trees.
Achievements and Realizations
- Successfully reduced RMSE through iterative model tuning.
- Gained insights into the impact of different features on model predictions.
- Improved my ability to preprocess and handle large datasets efficiently.
New Changes and Their Effects
- Implemented Time Series K-Fold data splitting which improved the model’s generalization capability.
- Introduced additional feature engineering techniques that enhanced prediction accuracy.
Limitations and Regrets
- Faced challenges in balancing model complexity and interpretability.
- Limited time to explore all potential feature interactions.
Plans for Future Competitions
- In the next competition, I plan to focus more on automated feature selection techniques.
- I will also explore more advanced neural network architectures to capture complex patterns in the data.
Overall, this competition was a valuable learning experience, providing practical insights into machine learning model development and the importance of teamwork in data science projects.
Leave a comment