Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Here's what I understand from it: Personal Information: The dataset includes basic details like names, gender, date of birth, and contact information (phone, email). Educational Background: It covers academic performance from 10th grade through college, including percentages and GPAs. It also includes information about the college attended, its tier, and location. Skills and Aptitude: The data captures scores in various subjects like English, Logical reasoning, and Quantitative skills. It also includes scores in specific engineering fields like Computer Science, Mechanical Engineering, etc. Personality Traits: Interestingly, the dataset includes measures of personality traits like conscientiousness, agreeableness, and extraversion. Professional Information: It contains details about salaries and company names, suggesting this might be employment-related data. Additional Details: There's also information about passports, credit cards, and addresses, which is unusual for a typical employment dataset and might need careful handling for privacy reasons. Data Quality: Almost all columns have 1,803 non-null values, indicating very little missing data. This is great for analysis! Data Types: The dataset has a mix of numerical (int64 and float64) and categorical (object) data types, which will require different handling techniques during analysis. This dataset seems well-suited for analyzing factors that might influence a person's salary or job prospects, but it's important to handle the sensitive personal information with care and in compliance with data protection regulations.
Run to view results
Our dataset is rich and diverse, containing a wide range of information about students and their backgrounds. It includes: Personal details: Things like ID, name, gender, and date of birth. Academic history: Information about their 10th and 12th grade performance, college details, and graduation year. Skills and knowledge: Scores in subjects like English, Logical reasoning, Quantitative skills, and various engineering fields. Personality traits: Measures of characteristics like conscientiousness and agreeableness. Location information: Details about where they studied and lived. Financial information: Salary and credit card details (though we should be careful with this sensitive data). Contact information: Phone numbers and email addresses. This variety of data gives us a comprehensive view of each student. We can use this to understand how different factors, from academic performance to personality traits, might affect a student's career prospects and salary. However, we should be mindful of privacy concerns, especially with sensitive information like credit card details. We might want to remove or encrypt such data before analysis. This dataset could be very useful for predicting career outcomes or understanding what factors contribute most to a student's success after graduation.
Run to view results
Student Performance: Most students did well in their 10th grade, with an average score of about 78%. In 12th grade, the average score dropped slightly to about 75%. College performance improved, with an average GPA of around 72 out of 100. College Information: Students graduated between 2001 and 2012, with most finishing around 2008-2009. There are two tiers of colleges, with most students (about 93%) in Tier 2. About 30% of students studied in major city colleges. Graduation and Employment: Most students graduated between 2012 and 2014. The average salary for graduates is about 85,715, with most salaries falling between 83,451 and 87,568. Personality Traits: Students show a wide range of personality traits. On average, students scored highest in agreeableness and lowest in neuroticism. Data Spread: There's a good mix of students with different backgrounds and performances. Salaries don't vary too much, suggesting consistent job market conditions. Interesting Points: Some students have very high GPAs (up to 99%), while others have quite low (as low as 6.63%). There's a big range in college sizes, from very small (ID: 2) to very large (ID: 18409). This data gives us a good overview of student performance from high school through college, and how it might relate to their starting salaries. It could be useful for understanding what factors might influence a graduate's initial salary in the job market.
Run to view results
Run to view results
The data quality for this dataset is excellent. Almost all columns have complete information, with no missing values. This is great news for our analysis because it means we don't need to spend time filling in gaps or removing incomplete records. There's only one small issue: the 'last_name' column is missing one value. This isn't a big problem, especially if we're not using last names in our main analysis. If we need to, we can easily handle this one missing value without affecting our overall results much. Having complete data like this is really helpful. It means we can trust our analysis more and we don't have to worry about missing information skewing our results. This completeness allows us to focus on understanding the relationships between different factors and how they might affect salaries, which is the main goal of our project. In simple terms, our dataset is in great shape and ready for us to start our analysis without any major data cleaning needed. This puts us in a strong position to get reliable insights about what influences graduate salaries.
Run to view results
The dataset has no duplicate rows. This is good news! It means each entry is unique, which helps ensure the accuracy of our analysis. We don't need to worry about removing duplicates or how they might affect our results. This clean data will make our predictions more reliable.
Run to view results
Insight on Outliers: Our dataset shows varying levels of outliers across different features. Some key observations: Many features have few or no outliers, suggesting clean data in these areas. Academic-related features like '12graduation' (14%) and 'ComputerProgramming' (21.6%) have higher outlier percentages, indicating diverse student backgrounds or potential data entry issues. Address-related features like 'postcode' (27.3%) and 'building_number' (14.6%) show significant outliers, which might reflect a wide geographic spread of students. Core features for predicting salary, such as 'collegeGPA', 'English', 'Logical', and 'Quant' scores, have very few outliers (mostly under 1%), suggesting reliable data for our main prediction task. The target variable 'Salary' has no outliers, which is good for our prediction model. These findings suggest we should pay special attention to handling outliers in features with high percentages, while our core predictive features seem robust. This information will guide our data preprocessing steps for building an effective salary prediction model.
Run to view results
The scatter plot of 10th percentage vs. Salary shows a slight positive trend. As 10th percentage increases, there's a modest rise in salary, indicated by the upward-sloping regression line. However, the wide scatter of points suggests that 10th percentage alone isn't a strong predictor of salary. The concentration of points is higher in the 70-90% range, implying that most high-earners scored well in 10th grade. The area around the line (likely representing confidence intervals) seems relatively wide, further indicating variability in the relationship. While better 10th grade performance generally corresponds to higher salaries, other factors clearly play significant roles in determining final compensation. This insight highlights the relationship between early academic performance and career earnings, while acknowledging the complexity of factors influencing salary outcomes
Run to view results
Insight for 12th Percentage vs. Salary Graph: The scatter plot shows a weak positive relationship between 12th percentage and salary. While there's a slight upward trend in the regression line, the wide spread of points suggests that 12th percentage alone isn't a strong predictor of future salary. Students with higher 12th percentages (70-90%) seem to have a broader range of salaries, including some higher-paying jobs. However, the considerable overlap in salary ranges across different percentages indicates that other factors likely play significant roles in determining salary outcomes. The graph highlights that while academic performance in 12th grade may contribute to career prospects, it's just one of many factors influencing eventual earning potential.
Run to view results
The scatter plot reveals a positive relationship between College GPA and Salary. Most students have GPAs between 60% and 80%. As GPA increases, we see a general trend of higher salaries, with average salaries rising from about $81,000 to $88,000. The clustering of points around the regression line suggests that GPA is a reliable predictor of salary, though other factors likely influence salary as well. This information could be valuable for students aiming to maximize their earning potential and for employers looking to gauge salary expectations based on academic performance.
Run to view results
The visualization reveals a significant skew towards undergraduate engineering degrees (B.Tech and BE) in our dataset. This predominance of bachelor's degrees over postgraduate qualifications (MCA and M.Tech/ME) suggests that the job market for entry-level engineering positions is primarily targeting fresh graduates. This insight could be valuable for both educational institutions in curriculum planning and for companies in tailoring their recruitment strategies. However, it's crucial to consider if this distribution accurately represents the broader job market or if it's a result of sampling bias in our data collection process.
Run to view results
The bar plot reveals a clear dominance of Electronics and Communication Engineering graduates in our dataset, followed closely by Computer Science Engineering. Information Technology and Computer Engineering also show significant representation. This distribution reflects the current tech industry trends, with a high demand for electronics and computer-related specializations. The prevalence of these fields suggests that our salary prediction model should be particularly robust for these specializations, potentially offering more accurate insights for tech-focused roles. However, we should be cautious about potential bias in our model towards these overrepresented fields and consider strategies to ensure fair predictions across all specializations.
Run to view results
The bar plot with overlaid data points effectively visualizes the relationship between degree, specialization, and salary. It reveals that B.Tech offers the most diverse range of specializations, followed by M.Tech/ME, and then MCA. Interestingly, despite the differences in degrees and specializations, the average salary across all categories hovers around ₹80,000 or slightly above. This visualization is particularly useful for: Identifying salary trends across different educational backgrounds Comparing the diversity of specializations within each degree Spotting any outliers or unusual patterns in salary distribution The use of color-coded bars for specializations and transparent data points allows for a comprehensive view of both average salaries and individual data points, providing a nuanced understanding of salary distributions within each category.
Run to view results
Insight: Skill Distribution Across Degrees The line graph visualizes the average skill scores across different degree programs, offering valuable insights into the strengths and weaknesses of graduates from various academic backgrounds. Key Observations: Skill Variability: There's significant variation in skill levels across different degrees, indicating that each program has its unique focus and strengths. Domain-Specific Excellence: B.Tech graduates show a notable peak in electronics skills, suggesting a strong emphasis on hardware-related courses in their curriculum. MCA students demonstrate superior computer science skills compared to B.Tech graduates, reflecting the specialized nature of their program. Consistent Strengths: Across all degrees, logical reasoning, English, and quantitative skills maintain relatively high scores, indicating these are core competencies developed regardless of the specific degree program. Program-Specific Patterns: B.Tech programs seem to produce well-rounded graduates with balanced scores across most skills. MCA programs appear to focus more on software and computer science skills at the expense of hardware-related skills. M.Tech programs show a pattern similar to B.Tech but with some variations, possibly due to specialization within the master's program. Skill Gaps: Consistently low scores in domain-specific skills like electronics, electrical, and telecom across most degrees suggest a potential gap in the curriculum or a need for more specialized programs in these areas. Business Implications: Recruitment Strategy: Companies can tailor their hiring strategies based on the skill strengths associated with different degrees. Training Programs: Organizations can design targeted training programs to address skill gaps specific to each degree background. Academic Partnerships: Universities could use this data to refine their curricula, ensuring graduates have a competitive skill set aligned with industry needs. This analysis provides a foundation for data-driven decision-making in both academic program design and corporate recruitment strategies, potentially leading to better alignment between graduate skills and industry requirements.
Run to view results
Run to view results
Insights from Distribution Graphs: Use: These histograms provide a visual representation of the distribution of personality traits across the dataset, offering crucial insights into the characteristics of the engineering graduates. How: By using KDE (Kernel Density Estimation) overlays, we can see both the raw data distribution and a smoothed probability density function, giving us a clearer picture of the underlying distribution. What we learn: Most personality traits show a relatively normal distribution, indicating a balanced representation of these traits in the graduate population. Openness to Experience and Agreeableness show negative skewness, suggesting that the majority of graduates in our dataset tend to score higher on these traits, with fewer individuals at the lower end of the spectrum. This skewness could indicate a potential selection bias in engineering programs or could reflect traits that are cultivated during engineering education. Insights from Box Plots: Use: Box plots help us visualize the spread of data, identify potential outliers, and compare distributions across different personality traits. How: The box represents the interquartile range (IQR), with the median shown as a line inside the box. Whiskers extend to show the rest of the distribution, and points beyond the whiskers are plotted as potential outliers. What we learn: Agreeableness and Openness to Experience show outliers on the lower end, confirming our observations from the histograms. These outliers could represent individuals with significantly different personality profiles compared to their peers. The presence of these outliers suggests that while most engineering graduates tend to score higher on these traits, there's still considerable variability in the population.
Run to view results
The distribution plots for skills and aptitude features reveal some interesting patterns: Most features like 'English', 'Logical', and 'Quant' show relatively normal distributions, indicating a balanced spread of these skills among graduates. However, several technical skills such as 'Electronics and Semiconductor', 'Computer Science', 'Mechanical Engineering', 'Electrical Engineering', 'Telecom Engineering', and 'Civil Engineering' display highly skewed distributions. The line starts high and quickly drops, then runs along the x-axis.
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Insights: <fill_this>
Run to view results
Run to view results
Insights: <fill_this>
Run to view results
Insights: <fill_this>
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Considerations: <fill_this>
Issues found: <fill_this>
Run to view results
Run to view results
The target variable 'Salary' is a continuous numerical variable, represented as float64 data type. This indicates we're dealing with a regression problem, aiming to predict precise salary values. The dataset contains 1803 salary entries, providing a substantial sample for analysis. The range of salaries (from approximately 80,000 to 87,000) suggests a relatively narrow distribution, which could imply a specific job level or industry focus in our dataset. This concentration of salaries might affect our model's ability to generalize to wider salary ranges. In subsequent steps, we should visualize the salary distribution using a histogram or box plot to identify any skewness or outliers that could influence our model's performance.
Run to view results
The salary distribution shows: Range: ~80k to ~90k, indicating a relatively narrow spread. Mean (~85.7k) is close to median (~86.3k), suggesting a fairly symmetric distribution. Standard deviation (~2.5k) is small relative to the mean, indicating low variability. 50% of salaries fall between ~83.5k and ~87.6k (IQR). This tight distribution implies: Consistent salary offerings in the dataset. Potential challenges in predicting small salary differences. Need for high model precision to capture subtle variations. Use: This information guides feature selection, model choice, and performance metric selection. It also sets realistic expectations for prediction accuracy. For visualization, a histogram or box plot would effectively illustrate this distribution, highlighting its symmetry and concentration.
Run to view results
The histogram reveals a right-skewed distribution of salaries, with the majority clustering between 80,000 and 90,000. The peak frequency occurs in the 86,000 to 88,000 range, indicating this as the most common salary bracket. This right-skew suggests that while most graduates earn within a specific range, there's potential for higher outlier salaries. The use of a histogram with a KDE overlay effectively visualizes both the discrete salary ranges and the overall distribution trend. This plot is crucial for understanding the central tendency and spread of our target variable, which informs our modeling approach and helps in identifying potential outliers or unique patterns in graduate salaries.
Run to view results
The boxplot reveals a clear correlation between College Tier and salary distribution. Tier 1 colleges show a higher median salary and a wider range, indicating greater earning potential but also more variability. Tier 2 colleges display a lower median salary with a narrower range, suggesting more consistent but generally lower earnings. Notably, Tier 1 exhibits some low outliers, which could represent unique cases or potential data anomalies worth investigating. This visualization effectively illustrates the impact of college prestige on graduate salaries, providing valuable insights for both educational institutions and job seekers in understanding the relationship between college tier and earning potential.
Run to view results
The boxplot visualization reveals significant salary disparities across different engineering specializations. Notably, Chemical Engineering stands out with higher median salaries and a wider range, indicating greater earning potential. Conversely, specializations like Instrumentation and Control Engineering and Electrical and Power Engineering show lower median salaries and narrower ranges.
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Considerations: <fill_this>
Issues found: <fill_this>
Run to view results
Insights: The dataset is heavily skewed towards engineering graduates, particularly B.Tech/B.E. While M.Tech graduates have a marginally higher median salary, B.Tech/B.E. graduates show more salary variability and potential for higher earnings. The predominance of engineering degrees suggests a tech-focused job market or dataset collection bias.
Run to view results
Run to view results
Run to view results
Insights: <fill_this>
Run to view results
Run to view results
Run to view results
Insights: <fill_this>
Run to view results
Run to view results
Considerations: <fill_this>
Issues found: <fill_this>
Run to view results
The countplot reveals that Information Technology, Computer Science & Engineering, Electronics & Communication, and Computer Engineering are the most common specializations in the dataset. This insight is crucial for understanding the job market demand and the potential competition in these fields.Salary Distribution by Specialization: The boxplot illustrates salary variations across different specializations. Most specializations show above-average salaries, indicating a generally lucrative field. However, some specializations like Instrumentation and Control Engineering, and Electrical and Power Engineering, show instances of below-average salaries.
Run to view results
Run to view results
Run to view results
Insights: <fill_this>
Run to view results
Run to view results
Run to view results
Insights: <fill_this>
Run to view results
Run to view results
Considerations: <fill_this>
Issues found: <fill_this>
Rationale: <fill_this>
Run to view results
In this crucial data preparation step, we strategically removed several columns from our datasets. These columns, including personal identifiers like 'profile', 'first_name', 'last_name', and sensitive information such as 'passport_number' and 'cc_number', were eliminated for multiple reasons: Irrelevance: Many of these fields, like 'street_name' or 'company_suffix', don't contribute meaningfully to predicting salary ranges. Privacy concerns: Removing personal data like 'email' and 'phone_number' ensures ethical handling of sensitive information. Potential bias: Columns like 'Gender' were removed to prevent inadvertent discrimination in our model. Redundancy: Some columns, such as 'full_name', were likely combinations of other fields and thus redundant. Non-predictive temporal data: 'GraduationYear' and 'DOB' were removed as they might introduce unwanted temporal biases. By eliminating these non-predictive features, we've streamlined our dataset, potentially improving model performance and reducing computational overhead. This careful feature selection allows us to focus on the most relevant attributes for salary prediction, enhancing both the efficiency and interpretability of our model.
Rationale: <fill_this>
Run to view results
Results: <fill_this>
Rationale: <fill_this>
Run to view results
Results: <fill_this>
I focused on encoding categorical variables to prepare the dataset for machine learning. By using pd.get_dummies(), I transformed categorical features into binary columns, allowing us to include them in numerical analyses. After dropping the original categorical columns, I concatenated the encoded features back into the DataFrame. This transformation ensures that all variables, including numerical scores and encoded categories, are in a suitable format for correlation analysis and model training. The resulting dataset is now ready for deeper exploratory data analysis and machine learning, enabling us to uncover valuable insights effectively.
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
In analyzing the correlation matrix, I've identified key insights for our salary prediction model. The strongest positive correlations with salary are found in quantitative skills, 10th and 12th percentage scores, and English proficiency, indicating their importance as predictors. Notably, the 10th and 12th percentage scores are highly correlated, so I recommend using only the 12th percentage to avoid multicollinearity. College GPA also correlates positively with salary, highlighting the significance of academic performance. Interestingly, the college tier shows a negative correlation with salary, suggesting that graduates from lower-tier colleges may be more driven or that top-tier institutions don't guarantee higher salaries. Computer engineering specialization has the highest positive correlation with salary, while some specializations like mechanical engineering show weaker or negative correlations. Overall, our model should prioritize quantitative skills, academic performance, and college GPA, along with specialization in computer-related fields. These insights will guide model development and provide valuable information for students and educational institutions.
I chose to use Mutual Information for feature selection because it captures both linear and non-linear relationships between features and the target variable. This method complements our correlation analysis by potentially uncovering complex interactions that simple correlations might miss. It's particularly valuable in our salary prediction model, where factors influencing income may have intricate, non-linear relationships.
Run to view results
Run to view results
Run to view results
The top 10 features based on Mutual Information scores are: 12th percentage (0.080170) 10th percentage (0.060776) Quant (0.058222) Logical (0.049693) English (0.045965) collegeGPA (0.041407) 12graduation (0.041113) ComputerProgramming (0.031300) conscientiousness (0.030559) neuroticism (0.027607) These results highlight the importance of academic performance (12th and 10th percentages), aptitude in key areas (Quant, Logical, English), and specific skills like Computer Programming. The inclusion of personality traits (conscientiousness and neuroticism) suggests these factors may have non-linear relationships with salary that weren't as apparent in our correlation analysis.
D.3 Final Selection of Features
Run to view results
Run to view results
In this code, I made sure all my data sets (training, validation, and testing) have the same features. I did this by making a list of all unique features from all three sets. Then, I added any missing features to each set, filling in zeros where needed. This way, all my data sets have the same columns. I did this because it's really important for machine learning models to have the same features in all stages - from training to testing. If the features are different, it can cause errors or give wrong results without us noticing. By making all data sets the same, I've set up a strong base for my model. This makes it easier to move between training, checking, and testing the model. It also helps my model work better with new data in the future, making it more reliable and easier to use in real situations.
Run to view results
Run to view results
Run to view results
Rationale: <fill_this>
Results: <fill_this>
Rationale: <fill_this>
Results: <fill_this>
Rationale: <fill_this>
Results: <fill_this>
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Rationale: <fill_this>
Run to view results
Run to view results
I established two baseline models for our salary prediction task: a Central Tendency Model and a Linear Regression Model. The Central Tendency Model uses the mean salary as a constant prediction, providing a simple benchmark and serving as a null hypothesis. It's easy to interpret and minimizes the sum of squared errors in the training set. The Linear Regression Model, tested with and without intercept, outperformed the constant prediction model on both training and validation sets, using MSE and R2 score as metrics. Key findings show that Linear Regression explains a significant portion of salary variance, and including an intercept improved performance. These baselines set clear performance targets for more advanced models, providing a solid foundation for our predictive modeling process. Next steps involve evaluating more complex models against these baselines, focusing on improving R2 scores and reducing MSE in future iterations.
Run to view results
Run to view results
Based on the output, our Linear Regression baseline models show concerning results. The model with an intercept (fit_intercept=True) performs better on the training set, with a lower MSE (4.77e+06) and a positive R2 score (0.212), indicating it explains about 21% of the variance in the training data. However, its performance on the validation set is extremely poor, with an enormously high MSE (1.75e+30) and a highly negative R2 score (-7.87e+22), suggesting severe overfitting. The model without an intercept (fit_intercept=False) performs poorly on both training and validation sets, with negative R2 scores indicating it's worse than a horizontal line. The extremely high validation MSE values for both models suggest there might be issues with data scaling, feature selection, or potential data leakage between the training and validation sets. These results indicate that our current approach is not suitable for predicting salaries, and we need to revisit our data preparation, feature engineering, and model selection processes to develop a more reliable predictive model.
Run to view results