🛰️ Grably Global Dataset
Version: 2.0.7 | Provider: Grably Data | Updated: 2025-12-26
Comprehensive structured dataset of public Telegram communication modalities.
Messages
+200m /mo
Tokens
+12b wds /mo
Images
+450m /mo
Videos
+150k /mo
Duration
+12m hrs /mo
Global Discovery Hub
Total Chats
1.2m
Total Channels
850k
Total Users
450m
Primary Use Cases
- 🚀 Large-scale LLM pretraining
- 🚀 Multimodal model training (vision, audio, video)
- 🚀 Safety & alignment research (reactions/growth)
- 🚀 Trend detection & topic modeling
- 🚀 Technical answering (e.g. software engineering)
- 🚀 Real-time streaming context
Sentiment & Engagement
Top Reaction Distribution Normalized to Billions
Aligning models to high engagement signals is a core use case supported by the rich metadata provided in this dataset.
Global Nodes
Active
S3 Storage
Distributed
Live Streamers
Concurrent
Data Provenance Flow
**Public Telegeram URL** ➡️ **Rendered Public Page** ➡️ **Parsed Content** ➡️ **Privacy Minimization** ➡️ **Modality-Specific Storage** ➡️ **Buyer-Scoped Delivery**
Parquet, TSV, JSONL
Monthly New Data Inflow
Text
12B words / 200M msgs
Images
450M
Video
12M / 150k hrs
Audio
8.5M
Sourcing & Privacy Positioning
Publicly accessible Telegram channels and groups. No private/restricted content.
PII Minimization Strategy:
- 🛡️ User IDs → hashed
- 🛡️ Group IDs → hashed
- 🛡️ Usernames → removed/randomized
- 🛡️ Emails & phone numbers → removed
- 🛡️ No IP addresses
🔒 Anonymization Modes
Global User Hash
User recognized throughout entire dataset.
Chat-Scoped User Hash
User recognized only within a group.
Chat + Time-Bound Hash
User hash changes every month within same group.
Compliance Strategy
Designed to facilitate compliance assessments under common enterprise governance frameworks. Written sourcing and compliance attestations are available upon request.