Dataset Generators

The jasmine.datasets module provides synthetic data generators for testing and examples.

Functions

generate_regression([n_samples, n_features, ...])

Generate a random regression problem with JAX.

generate_classification([n_samples, ...])

Generate a random n-class classification problem with.

generate_polynomial([n_samples, degree, ...])

Generate a polynomial regression problem with one feature.

generate_regression

jasmine.datasets.generate_regression(n_samples=100, n_features=20, n_informative=10, noise=0.0, bias=0.0, shuffle=True, coef=False, random_state=None)[source]

Generate a random regression problem with JAX.

This function creates a dataset where the output is a linear combination of a subset of the input features, with optional Gaussian noise.

Parameters:
  • n_samples (int) – The number of samples to generate.

  • n_features (int) – The total number of features.

  • n_informative (int) – The number of features that are actually used to generate the output. The rest are noise.

  • noise (float) – The standard deviation of the Gaussian noise added to the output.

  • bias (float) – The bias term (intercept) in the underlying linear model.

  • shuffle (bool) – Whether to shuffle the features and informative indices. If False, the informative features will always be the first n_informative columns.

  • coef (bool) – If True, the ground truth coefficients and bias are returned.

  • random_state (int, optional) – Seed for the random number generator for reproducibility. If None, a random seed is used.

Returns:

By default, returns (X, y).

If coef is True, returns (X, y, ground_truth_coefficients).

Return type:

tuple

generate_classification

jasmine.datasets.generate_classification(n_samples: int = 100, n_features: int = 20, n_informative: int = 5, n_redundant: int = 2, n_classes: int = 2, class_sep: float = 1.0, feature_noise: float = 1.0, redundant_noise: float = 0.0, shuffle: bool = True, random_state: int | None = None) Tuple[Array, Array][source]

Generate a random n-class classification problem with.

This function creates clusters of points normally distributed around vertices of a hypercube, making it suitable for testing classification algorithms.

Parameters:
  • n_samples – The number of samples.

  • n_features – The total number of features.

  • n_informative – The number of informative features.

  • n_redundant – The number of redundant features (linear combinations of informative features).

  • n_classes – The number of classes (or labels).

  • class_sep – Factor multiplying the hypercube size. Larger values spread out the classes and make the problem easier.

  • shuffle – Whether to shuffle the features.

  • random_state – Seed for the random number generator.

Returns:

A tuple (X, y) where X is the feature matrix and y are the integer labels.

generate_polynomial

jasmine.datasets.generate_polynomial(n_samples: int = 100, degree: int = 2, noise: float = 0.0, bias: float = 0.0, coef: bool = False, random_state: int | None = None)[source]

Generate a polynomial regression problem with one feature.

Parameters:
  • n_samples – The number of samples.

  • degree – The degree of the polynomial relationship.

  • noise – The standard deviation of the Gaussian noise.

  • bias – The bias term (intercept).

  • coef – If True, the ground truth coefficients and bias are returned.

  • random_state – Seed for the random number generator.

Returns:

By default, returns (X, y). X will have a shape of (n_samples, 1). If coef is True, returns (X, y, ground_truth_coefficients, bias).

Examples

Regression Data

from jasmine.datasets import generate_regression
import jax.numpy as jnp

# Basic regression dataset
X, y = generate_regression(
    n_samples=1000,
    n_features=20,
    n_informative=5,
    noise=0.1,
    random_state=42
)

print(f"Data shape: {X.shape}")
print(f"Target range: [{jnp.min(y):.2f}, {jnp.max(y):.2f}]")

# With ground truth coefficients
X, y, coef = generate_regression(
    n_samples=500,
    n_features=10,
    coef=True,
    random_state=42
)

print(f"True coefficients: {coef}")

Classification Data

from jasmine.datasets import generate_classification

# Binary classification
X, y = generate_classification(
    n_samples=1000,
    n_features=20,
    n_informative=10,
    n_classes=2,
    class_sep=1.0,
    random_state=42
)

print(f"Data shape: {X.shape}")
print(f"Class distribution: {jnp.bincount(y)}")

# Multi-class classification
X, y = generate_classification(
    n_samples=1500,
    n_features=15,
    n_classes=3,
    n_informative=8,
    random_state=42
)

Polynomial Data

from jasmine.datasets import generate_polynomial

# Polynomial regression dataset
X, y = generate_polynomial(
    n_samples=500,
    degree=3,
    noise=0.1,
    random_state=42
)

print(f"Polynomial data shape: {X.shape}")

Usage Tips

  • Use random_state for reproducible datasets

  • Adjust noise parameter to control data difficulty

  • n_informative controls how many features actually affect the target

  • class_sep in classification controls how well-separated the classes are

  • All generators return JAX arrays for seamless integration