Dataset Generators

The jasmine.datasets module provides synthetic data generators for testing and examples.

Functions

`generate_regression`([n_samples, n_features, ...])	Generate a random regression problem with JAX.
`generate_classification`([n_samples, ...])	Generate a random n-class classification problem with.
`generate_polynomial`([n_samples, degree, ...])	Generate a polynomial regression problem with one feature.

generate_regression

jasmine.datasets.generate_regression(n_samples=100, n_features=20, n_informative=10, noise=0.0, bias=0.0, shuffle=True, coef=False, random_state=None)[source]

Generate a random regression problem with JAX.

This function creates a dataset where the output is a linear combination of a subset of the input features, with optional Gaussian noise.

Parameters:

n_samples (int) – The number of samples to generate.
n_features (int) – The total number of features.
n_informative (int) – The number of features that are actually used to generate the output. The rest are noise.
noise (float) – The standard deviation of the Gaussian noise added to the output.
bias (float) – The bias term (intercept) in the underlying linear model.
shuffle (bool) – Whether to shuffle the features and informative indices. If False, the informative features will always be the first n_informative columns.
coef (bool) – If True, the ground truth coefficients and bias are returned.
random_state (int, optional) – Seed for the random number generator for reproducibility. If None, a random seed is used.

Returns:

By default, returns (X, y).: If coef is True, returns (X, y, ground_truth_coefficients).

Return type:

tuple

generate_classification

jasmine.datasets.generate_classification(n_samples: int = 100, n_features: int = 20, n_informative: int = 5, n_redundant: int = 2, n_classes: int = 2, class_sep: float = 1.0, feature_noise: float = 1.0, redundant_noise: float = 0.0, shuffle: bool = True, random_state: int | None = None) → Tuple[Array, Array][source]

Generate a random n-class classification problem with.

This function creates clusters of points normally distributed around vertices of a hypercube, making it suitable for testing classification algorithms.

Parameters:

n_samples – The number of samples.
n_features – The total number of features.
n_informative – The number of informative features.
n_redundant – The number of redundant features (linear combinations of informative features).
n_classes – The number of classes (or labels).
class_sep – Factor multiplying the hypercube size. Larger values spread out the classes and make the problem easier.
shuffle – Whether to shuffle the features.
random_state – Seed for the random number generator.

Returns:

A tuple (X, y) where X is the feature matrix and y are the integer labels.

generate_polynomial

jasmine.datasets.generate_polynomial(n_samples: int = 100, degree: int = 2, noise: float = 0.0, bias: float = 0.0, coef: bool = False, random_state: int | None = None)[source]

Generate a polynomial regression problem with one feature.

Parameters:

n_samples – The number of samples.
degree – The degree of the polynomial relationship.
noise – The standard deviation of the Gaussian noise.
bias – The bias term (intercept).
coef – If True, the ground truth coefficients and bias are returned.
random_state – Seed for the random number generator.

Returns:

By default, returns (X, y). X will have a shape of (n_samples, 1). If coef is True, returns (X, y, ground_truth_coefficients, bias).

Examples

Regression Data

from jasmine.datasets import generate_regression
import jax.numpy as jnp

# Basic regression dataset
X, y = generate_regression(
    n_samples=1000,
    n_features=20,
    n_informative=5,
    noise=0.1,
    random_state=42
)

print(f"Data shape: {X.shape}")
print(f"Target range: [{jnp.min(y):.2f}, {jnp.max(y):.2f}]")

# With ground truth coefficients
X, y, coef = generate_regression(
    n_samples=500,
    n_features=10,
    coef=True,
    random_state=42
)

print(f"True coefficients: {coef}")

Classification Data

from jasmine.datasets import generate_classification

# Binary classification
X, y = generate_classification(
    n_samples=1000,
    n_features=20,
    n_informative=10,
    n_classes=2,
    class_sep=1.0,
    random_state=42
)

print(f"Data shape: {X.shape}")
print(f"Class distribution: {jnp.bincount(y)}")

# Multi-class classification
X, y = generate_classification(
    n_samples=1500,
    n_features=15,
    n_classes=3,
    n_informative=8,
    random_state=42
)

Polynomial Data

from jasmine.datasets import generate_polynomial

# Polynomial regression dataset
X, y = generate_polynomial(
    n_samples=500,
    degree=3,
    noise=0.1,
    random_state=42
)

print(f"Polynomial data shape: {X.shape}")

Usage Tips

Use random_state for reproducible datasets
Adjust noise parameter to control data difficulty
n_informative controls how many features actually affect the target
class_sep in classification controls how well-separated the classes are
All generators return JAX arrays for seamless integration