Data science has become quite a popular career path for students with STEM backgrounds and I thought I’d compile a list of topics and questions that are very often asked about during technical interviews for DS jobs or internships. Keep in mind this is not an exhaustive list but it’s a very good starting point as you are expected to have a good idea about all of these things.
- Supervised vs unsupervised learning
- Linear Regression: pros and cons, assumptions
e.g. Suppose we have two variables, X and Y, where Y = X + some normal white noise. We regress Y on X, what will the coefficients be? If we then regress X on Y, what would happen? - Regularisation methods – L1, L2 (Ridge and Lasso)
- Logistic Regression
- Confounding variables
- Variance and bias, the tradeoff
- Decision trees and random forest
- SVM and the kernel function
- K-means clustering
- Classification
e.g. for binary classification how to calculate accuracy - Over and underfitting
- Feature/variable selection
- K-fold cross validation
- How to deal with missing data
- Dimension reduction (PCA, SVD)
- How to deal with outliers? Normalise, remove etc.
- Eigenvalues, eigenvector (e.g. obtain them for 3x3 matrix)
- Error metrics: RMSE, MSE, MAE
- R squared (definition, how to interpret)
- A/B testing
- Hypothesis testing: p-values, confidence intervals, true positive rate and false positive rate, ROC curve, power of a test, Type I and type II error, Z-test vs T-test and when to use which
e.g. Does an ROC curve change if you square the outputs used to generate it?
e.g. Flip one coin 10 times and obtain one head, what is the p value and null hypothesis
e.g. Flipping coins, how to test if coin is fair - Stationarity in time series
- Confusion matrix and calculate accuracy, precision and recall rate, F1 score
- SQL
- Recommender systems: collaborative filtering vs content-based filtering
- Gradient descent basics
- Scenarios of when to use different methods/models
e.g. We want to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate algorithm for this case? Logistic Regression. - Star schema