top of page

Introducing the "ArXiv BEIR" Dataset: Advancing RAG Systems

We're excited to introduce the "ArXiv BEIR" dataset, inspired by BEIR benchmarks, which focuses on ArXiv abstracts. This dataset is tailored to support the development of Retrieval-Augmented Generation (RAG) systems in a straightforward and practical manner

Introducing the "ArXiv BEIR" Dataset: Advancing RAG Systems

Introducing the "ArXiv BEIR" Dataset: Advancing RAG Systems

We're excited to introduce the "ArXiv BEIR" dataset, inspired by BEIR benchmarks, which focuses on ArXiv abstracts. This dataset is tailored to support the development of Retrieval-Augmented Generation (RAG) systems in a straightforward and practical manner.

The ArXiv BEIR Dataset

Our dataset comprises corpus/query pairs derived from ArXiv abstracts across various mathematical categories. It includes a corpus, queries, and a qrels file (relevance judgments) meticulously organized to assist in the development and evaluation of RAG systems.

Structured Data for Development

The dataset is structured as follows:

- **Corpus File**: A .jsonl file with document information, including a unique document identifier, an optional document title, and the document text.

- **Queries File**: Another .jsonl file containing unique query identifiers and corresponding query texts.

- **Qrels File**: A .tsv file with three columns, representing the query-id, corpus-id, and relevance score, for evaluating RAG systems.

Supporting RAG System Development

The "ArXiv BEIR" dataset serves as a valuable resource for researchers and developers working on RAG systems. It offers diverse documents and queries, challenging AI models to retrieve relevant information and generate meaningful responses. This dataset is essential for fine-tuning and evaluating RAG models, helping them perform better in practical applications.

Key Data Fields

For consistency and clarity, we've defined key data fields:

Corpus
- corpus: A dictionary with document information, including a unique document id, document title, and document text.

Queries
- queries: A dictionary with query information, including a unique query id and query text.

Qrels
- qrels: A dictionary with query-document relevance judgments, including a query id, document id, and relevance score.

Advancing RAG Systems

As the field of AI continues to evolve, RAG systems hold promise for enhancing natural language understanding and generation. The "ArXiv BEIR" dataset aims to contribute to this progress by providing a practical platform for testing and improving RAG models. Researchers and developers can use this resource to fine-tune models, conduct evaluations, and drive innovations in AI-driven information retrieval and question-answering.

Stay updated for more resources as we work together to advance RAG system development.

Join us in AI-driven information retrieval and question-answering with the "ArXiv BEIR" dataset

bottom of page