Creating a Dataset
In Prem, you can build datasets in two ways:- Upload an existing dataset in
JSONLformat. - Generate synthetic datasets directly from different input sources such as files, YouTube videos, websites, or a mix of sources.
Uploading a Dataset
Generate a Dataset with Synthetic Data Generation
Synthetic data generation lets you create datasets from various input sources beyond JSONL. You can import documents, scrape websites, process YouTube videos, or combine multiple sources into one dataset.Step 1: Define Dataset and Sources
- Enter a descriptive dataset name.
- Choose one or more data sources:
- Files Only: PDF, DOCX, TXT, HTML, PPTX
- YouTube Videos: individual videos or playlists
- Web Scraping: one or more website URLs
- Mixed Sources: combine multiple input types
- Set the number of QA pairs to generate from each source.

Step 2: Set Optional Guidance (Optional)
Get control of the generation process by setting additional parameters:
- Rules & Constraints – add conditions for the generated content (e.g., enforce style, define tone, restrict scope).
- QA Guidance – provide example pairs or specify output formats.
- Creativity Level – adjust the model’s temperature to balance consistency vs. variety.

Options for Synthetic Data Generation
When generating synthetic datasets, you can configure:- Data Sources – files, YouTube videos, websites, or a mix.
- Synthetic Pairs Configuration – number of QA pairs per input.
- Rules & Constraints – optional rules to shape the outputs.
- QA Guidance – add examples or output specifications.
- Creativity Level – control the randomness of the generation.


