Common Machine Learning Notation Standards

Data and Dimensions

  • m - Number of training examples
  • n - Number of features/dimensions
  • X - Input matrix/feature matrix (usually ∈ ℝ^(m×n))
  • x - Single input vector or feature
  • x^(i) - The i-th training example
  • x_j - The j-th feature
  • x_j^(i) - The j-th feature of the i-th example
  • y - Output/target/label vector
  • y^(i) - Label for the i-th training example
  • ŷ (y-hat) - Predicted output

Model Parameters

  • θ (theta) - Parameter vector (common in classical ML)
  • w or W - Weight matrix or vector
  • b - Bias term
  • β (beta) - Regression coefficients (statistics convention)
  • α (alpha) - Learning rate
  • λ (lambda) - Regularization parameter
  • ε (epsilon) - Small value for numerical stability or error threshold

Functions and Operations

  • h(x) or h_θ(x) - Hypothesis function
  • f(x) - General function or true underlying function
  • g(x) - Often activation function
  • J(θ) - Cost/objective function
  • L - Loss function (for single example)
  • - Alternative loss notation
  • (nabla) - Gradient operator
  • σ (sigma) - Sigmoid function or standard deviation
  • Σ - Summation

Neural Network Specific

  • L - Number of layers
  • n^[l] - Number of units in layer l
  • W^[l] - Weight matrix for layer l
  • b^[l] - Bias vector for layer l
  • a^[l] - Activations for layer l
  • z^[l] - Pre-activation values for layer l
  • δ (delta) - Error term in backpropagation

Probability and Statistics

  • P(x) - Probability of x
  • p(x) - Probability density/mass function
  • 𝔼[X] - Expected value of X
  • μ (mu) - Mean
  • σ² - Variance
  • Σ - Covariance matrix (context-dependent)
  • 𝒩(μ, σ²) - Normal distribution
  • ~ - “Distributed as” (e.g., X ~ 𝒩(0,1))

Optimization

  • t - Time step or iteration number
  • η (eta) - Learning rate (alternative to α)
  • ρ (rho) - Momentum coefficient or discount factor
  • γ (gamma) - Discount factor (reinforcement learning)
  • - Partial derivative

Evaluation and Splits

  • X_train, X_test, X_val - Training, test, validation sets
  • k - Number of clusters (k-means) or folds (k-fold CV)
  • K - Number of classes in classification
  • TP, FP, TN, FN - True/False Positives/Negatives
  • ACC - Accuracy
  • MSE - Mean Squared Error
  • RMSE - Root Mean Squared Error

Matrix Operations

  • X^T - Transpose of X
  • X^{-1} - Inverse of X
  • - Element-wise multiplication (Hadamard product)
  • · or ⟨,⟩ - Dot product
  • ||x|| - Norm of x (often L2 unless specified)
  • I - Identity matrix

Special Notation

  • 1{condition} - Indicator function (1 if true, 0 if false)
  • argmax - Argument that maximizes
  • argmin - Argument that minimizes
  • log - Natural logarithm (unless log base specified)
  • exp - Exponential function
  • 𝟙 - Vector of ones
  • 𝟎 - Vector of zeros

Indexing Conventions

  • i - Typically indexes training examples (1 to m)
  • j - Typically indexes features (1 to n)
  • k - Typically indexes output classes
  • l - Typically indexes layers in neural networks
  • t - Typically indexes time steps

Set Notation

  • 𝒟 - Dataset
  • - Real numbers
  • ℝ^n - n-dimensional real space
  • - “Element of”
  • - “Subset of”