Reports & Publications

Sensitive Data Discovery and Classification at Scale Accuracy & Throughput Evaluation

Sponsor: comforte AG
Sensitive Data Discovery and Classification at Scale  Accuracy & Throughput Evaluation

Abstract

Comforte AG commissioned Tolly to evaluate comforte SecurDPS Discover and Classify, with the main focus on benchmarking the solution’s accuracy and throughput for large-scale sensitive-data discovery and classification across structured and unstructured data sources. The project included testing on relational databases, flat files, image data processed with OCR, and a Microsoft Exchange email environment to assess how effectively the platform can identify data at scale for security, compliance, governance, and privacy use cases.  


The report emphasizes that data visibility is the required first step in protecting sensitive information. Tolly evaluated comforte’s supervised-AI approach using defined root data assets, or RDAs, as trusted data examples to guide subsequent scans. In the structured-data accuracy test, comforte scanned a PostgreSQL database table containing 1,000 records and 100 columns, 38 of which contained personally identifiable information. The solution recognized 37,468 of 38,000 sensitive data elements, delivering 98.6% accuracy. In the unstructured-data accuracy test, a folder containing 200 flat text files with 1,000 total data records was scanned, and comforte detected all 1,000 records for 100% accuracy.  


Throughput testing was designed to reflect real-world environments at substantial scale. In a PostgreSQL database environment containing 63.2 million rows, 1,000 tables, 9,468 columns, and a total database size of 9.8GB, comforte’s initial “security” classification scan reviewed column names and sample data in 25 seconds, achieving 378 columns per second and 2.5GB per second. In the deeper “privacy” scan, which evaluated all rows to build a searchable index of personal data, the system processed 45,730 rows per second at 0.7MB per second over about 23 minutes.  


Tolly also measured large-scale scanning of unstructured repositories. In a file-share test covering roughly 754,858 files and about 0.97TB of data, a serial scan processed about 8 files per second at 10.23MB per second, while parallel scanning of ten 100GB repositories raised throughput to 76.93MB per second and reduced total runtime to 3.5 hours. In OCR testing, the solution processed 1,000 images at 286 images per hour in serial mode and 2.64MB per second in parallel mode. In a Microsoft Exchange test with 38,000 emails, including 1,000 with personal data in PDF attachments, comforte scanned at 1.5 emails per second. Overall, the report presents comforte SecurDPS Discover and Classify as an accurate and scalable platform for discovering and classifying sensitive data across diverse enterprise repositories.