Question 1

You are working on a predictive model and you notice that your categorical variable "Color" has 50 different levels, which is adding complexity to your model. You decide to group these levels into fewer categories based on their frequency of occurrence to improve the model’s interpretability and performance. What technique would you most likely use for this task?

Options :

A :

Principal Component Analysis (PCA)

B :

Decision Tree Binning

C :

Top-N Binning

D :

One-hot Encoding

Answer: C

Question 2

You are developing a predictive model using SAS and have completed the model training phase. You are now preparing to deploy the model for scoring a new dataset. Which mode should you operate in to score the new dataset and why?

Options :

A :

Training mode, because the model needs to be retrained on the new dataset before it can be scored.

B :

Training mode with cross-validation, to ensure that the model generalizes well to the new dataset.

C :

Output mode, as the model is already trained and the goal now is to use the model to predict new outcomes.

D : Output mode, but with added regularization parameters to avoid overfitting on the new dataset.

Answer: C

Question 3

You are conducting a time series analysis and need to estimate the parameters of an ARIMA (Autoregressive Integrated Moving Average) model to forecast future sales. Given that the data show signs of non-stationarity and seasonality, which parameter estimation method should you use for the best results?

Options :

A :

Ordinary Least Squares (OLS)

B :

Maximum Likelihood Estimation (MLE)

C :

Generalized Least Squares (GLS)

D :

Instrumental Variables Estimation (IVE)

Answer: B

Question 4

A data analyst is using the RANDOMFOREST procedure to create a predictive model. They want to specify the number of trees to be generated in the random forest for a robust prediction. Which option in the RANDOMFOREST statement correctly sets the desired number of trees?

Options :

A :

NEST=100

B :

NTREE=100

C :

TREES=100

D :

NTREES=100

Answer: B

Question 5

When preparing data for a predictive modeling project, a data scientist notices that the categorical variable 'payment_type' with four categories ('credit card', 'debit card', 'paypal', 'other') exhibits a high degree of variability in the outcome variable (purchase amount). To improve the model's predictive accuracy, what strategy can the data scientist use to handle the 'payment_type' variable?

Options :

A :

Group all categories into a single 'payment_type' indicator variable.

B :

Perform one-hot encoding on 'payment_type' to create a binary indicator for each category.

C :

Collapse 'paypal' and 'other' into a single category if they have similar effects on the purchase amount.

D :

Treat 'payment_type' as a continuous variable to simplify the model.

Answer: B