Questions & topics covered in technical interviews for data science internships/jobs PART I

TechGuru · September 11, 2021, 5:24pm

Data science has become quite a popular career path for students with STEM backgrounds and I thought I’d compile a list of topics and questions that are very often asked about during technical interviews for DS jobs or internships. Keep in mind this is not an exhaustive list but it’s a very good starting point as you are expected to have a good idea about all of these things.

Supervised vs unsupervised learning
Linear Regression: pros and cons, assumptions
e.g. Suppose we have two variables, X and Y, where Y = X + some normal white noise. We regress Y on X, what will the coefficients be? If we then regress X on Y, what would happen?
Regularisation methods – L1, L2 (Ridge and Lasso)
Logistic Regression
Confounding variables
Variance and bias, the tradeoff
Decision trees and random forest
SVM and the kernel function
K-means clustering
Classification
e.g. for binary classification how to calculate accuracy
Over and underfitting
Feature/variable selection
K-fold cross validation
How to deal with missing data
Dimension reduction (PCA, SVD)
How to deal with outliers? Normalise, remove etc.
Eigenvalues, eigenvector (e.g. obtain them for 3x3 matrix)
Error metrics: RMSE, MSE, MAE
R squared (definition, how to interpret)
A/B testing
Hypothesis testing: p-values, confidence intervals, true positive rate and false positive rate, ROC curve, power of a test, Type I and type II error, Z-test vs T-test and when to use which
e.g. Does an ROC curve change if you square the outputs used to generate it?
e.g. Flip one coin 10 times and obtain one head, what is the p value and null hypothesis
e.g. Flipping coins, how to test if coin is fair
Stationarity in time series
Confusion matrix and calculate accuracy, precision and recall rate, F1 score
SQL
Recommender systems: collaborative filtering vs content-based filtering
Gradient descent basics
Scenarios of when to use different methods/models
e.g. We want to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate algorithm for this case? Logistic Regression.
Star schema

mugglehead · September 13, 2021, 2:09am

Are data science roles feasible for someone who has never taken stats?

TechGuru · September 13, 2021, 2:42pm

Yes definitely, but you may have to spend a little more time preparing.