Introducing CCI: A High-Quality Chinese Internet Language Dataset For AI

Hello r/datasets Community,

We’re excited to introduce the Chinese Corpora Internet (CCI) dataset v1.0.0, a high-quality Chinese internet language dataset, meticulously developed by BAAI with the support of leading institutions and tech partners. CCI is designed to be the cornerstone of AI research requiring high-quality Chinese language data.

CCI’s standout features:

Vast Scale: CCI offers an impressive 104GB of data, providing a broad spectrum of linguistic information. Time Span: The dataset encompasses over two decades of data, from January 2001 to November 2023, offering historical depth and contemporary relevance. Quality Sources: Data is sourced from trusted and authoritative Chinese internet platforms, ensuring high fidelity and relevance. Rigorous Processing: CCI has undergone extensive cleaning, deduplication, and quality checks to ensure the highest standards of data integrity. Safe and Reliable: With a focus on safety and reliability, CCI has been filtered through advanced techniques to remove any sensitive or inappropriate content. Benchmark Filtering: Unique to CCI, we’ve implemented stringent checks against mainstream Chinese benchmark datasets to prevent “teaching to the test” in model training.

Download CCI and join us in shaping the future of AI:

BAAI Open Data Repository: https://data.baai.ac.cn/details/BAAI-CCI HuggingFace: https://huggingface.co/datasets/BAAI/CCI-Data

We’re eager to see the innovative applications and research that will emerge from the community’s use of CCI. Your participation and feedback are crucial to the continuous improvement of this dataset.

Cheers,

The BAAI Team

Supported by: CSAC, Beijing Municipal Cyberspace Administration, Beijing Municipal Science & Technology Commission, Zhongguancun Administrative Committee, Haidian District Government, our tech partners TRS and Wenge.

submitted by /u/lukai-baai
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *