logo

UTK Notes


Quiz #3

Question 1

To define a function in Python, such as computing the sum of two input arguments x and y, the general code we use is as follows:

Question01.png

But since adding two numbers is a simple operation, how can you define the function inline and with less typing?

(Hint: Use the anonymous function in Python. If you are not sure, try experimenting with the code provided in each choice)

A. f = lambda (x, y): x + y
B. f = lambda x, y: return x + y
C. f = lambda x, y: x + y
D. f = lambda x, y: (x + y)

Answer C. `f = lambda x, y: x + y`

Question 2

In Python, what is the correct function name for a constructor of a custom Python class object?

A. __constructor__()
B. _init_()
C. myObject() (assuming the object class is named as “myObject”)
D. __init__()

Answer D. `__init__()`

Question 3

We learned about the true hypothesis $f$, and the true data density function $f(x,y)$. Suppose we sampled $f(x,y)$ for images of cats and dogs, each sample denoted as $\left(x^{\left(i\right)},y^{\left(i\right)}\right)$ for i = 1, ..., n, where $n$ is the sample size, $x^{(i)}$ is an image, and $y^{(i)}$ is its label. Note that this is a binary classification problem with “cat” images labeled as “$y=0$” and “dog” images labeled as “y=1”.

Suppose we know $f(x,y)$, which produces the true probability of the pair $(x,y)$ belonging to $f$. A correct use of the Bayesian Optimal Classifier theorem is to classify the image $x$ as a “cat” image if

\[f(x, y = 0) < f(x, y = 1)\]

True or False?

A. True
B. False

Answer B. False. If the $x$ image corresponds to a "cat" or $y=0$ category, then the probability $f(x,0)$ should be higher than $f(x,1)$.

Question 4

In Machine Learning, we use the term __A__ to describe the error between the ground truth and our model prediction, contributed by one sample, and the term __B__ to describe the expected model error measured on all the samples.

What is A and B?

A. A: “Gradient”; B: “Cost”
B. A: “Loss”; B: “Cost”
C. A: “Gradient”; B: “Loss”
D. A: “Cost”; B: “Loss”

Answer B. A: "Loss"; B: "Cost"

Question 5

Using the notations from class, we define

  • $\{(x^{(1)}, y^{(1)}), …, (x^{(n)}, y^{(n)})\}$to be the dataset of n samples we collected from the underlying distribution $f(x, y)$.
  • $h_\theta(x)$ to be a model that takes the input $x$ and output the prediction $\hat{y}$, parameterized by $\theta$
  • $L(y, \hat y)$ to be the loss function that measures the error between ground truth $y$ and model prediction $\hat{y} = h(x)$.
  • $E_{x, y\sim f}(.)$ to be the expectation operation on some function of $(x, y)$ over the distribution $f$

Then, in Machine Learning, our goal is to find $\hat{\theta}$ and, therefore the model $h_{\hat \theta}(x)$ such that

\[\hat \theta = \text{argmax}_{\theta \in \Theta}E_{x, y \sim f}[f(x, y)L(y, h_\theta(x))]\]

i.e., the parameter that corresponds to the highest expected loss over $f$

A. True
B. False

Answer B. False. This is false because we want to reduce the expected loss, which requires minimization rather than maximization, as stated in the problem.

Question 6

Match the names of different types of Gradient Descent algorithms with their correct descriptions.

Batch:

A. In each iteration, the gradient for update is computed using just one (randomly picked) sample.
B. In each iteration, the gradient for update is computed using a subset of samples from the batch.
C. In each iteration, the gradient for update is computed using all samples.

Answer C. In each iteration, the gradient for update is computed using all samples.

Stochastic:

A. In each iteration, the gradient for update is computed using just one (randomly picked) sample.
B. In each iteration, the gradient for update is computed using a subset of samples from the batch.
C. In each iteration, the gradient for update is computed using all samples.

Answer A. In each iteration, the gradient for update is computed using just one (randomly picked) sample.

Mini-Batch:

A. In each iteration, the gradient for update is computed using just one (randomly picked) sample.
B. In each iteration, the gradient for update is computed using a subset of samples from the batch.
C. In each iteration, the gradient for update is computed using all samples.

Answer B. In each iteration, the gradient for update is computed using a subset of samples from the batch.

Question 7

Below is a plot of the cost $J$ as a function of the parameter $\theta$, which also shows the gradient descent (GD) steps on the parameter from iteration #1-7.

Question07.png

Based on the plot and the gradient descent algorithm, assuming $\alpha > 0$, which is correct about the gradient $d\theta$ computed at iteration #3?

A. $d\theta > 0$
B. $d\theta < 0$
C. $d\theta = 0$

Answer A. $d\theta > 0$ At iteration #3, the cost is lower in the direction of a smaller theta; therefore, the gradient must be positive.

Question 8

What are the preferred properties of an objective function?

Select all that are correct.

A. Adequate sensitivity to outliers
B. Convex
C. Computationally efficient
D. Interpretable
E. Aligned with the use case
F. Differentiable everywhere

Answer All the above are correct (A, B, C, D, E, and F).