Automated PII Catalog powered by Datahub for real-time sensitive data observability

Hemant Kumar4/5/2023 7 Min Read

And how it helps Privacy Teams, Data Engineers, and Data Owners solve compliance gaps


Achieve data governance at scale with Borneo and Acryl Data (DataHub)'s joint "Automated Governance Catalog". A DataOps-led Governance solution for fast-moving data teams using real-time privacy observability integrated with metadata-orchestrated business workflows.


To achieve Governance at scale in the modern data stack, there are a few fundamental prerequisites:


  • Understanding what data lives where along with lineage information about how the data got there
  • Semantic categorization of data based on Business terminology and classification based on Governance/Compliance requirements
  • Automated data management using semantic labels with appropriate human-in-the-loop workflows

Given the scale and diversity of the modern data stack, categorizing data through a manual process of tagging certain datasets does not scale. Data pipelines are constantly evolving, replicating and morphing data into different formats and locations. A manual, tedious process of trying to keep up with all the changes simply burdens the team and creates a huge risk of major datasets going out of compliance.


A quick data catalog primer

A modern data catalog enables creating a complete data graph through the collection of technical, business and operational metadata across many data sources, including APIs, datasets, pipelines, dashboards, AI models, features. Acryl Data powered by the open source project DataHub enables three main use cases:


  1. Data discovery
  2. Automated data governance
  3. Data Observability


How is Datahub used by Privacy Engineers?

  1. Understand the privacy structure and remain compliant with the ever changing data privacy regulations.
  2. Facilitate data discovery, understanding and reuse through features like business glossary.
  3. Support data sharing and collaboration.


What is the missing piece?

The real work begins after all those thousands of resources have been discovered by the Data Catalog system in your account. The process of manually inspecting the dataset's metadata and tagging it with standardized terms from a business glossary is not only inefficient but can also be inaccurate. Automation is key here with the appropriate human-in-the-loop workflows for approvals.


Current Process and Problems

  • The current process relies on someone manually looking at the metadata and tag it with a term defined in the catalog system.
  • In many cases, this takes many hundreds of hours of effort in a mid to large sized company.
  • Looking at just the metadata is NOT enough. The field name could be as generic as event_data and it might contain millions of email addresses or even credit card numbers (Developers love logging stuff, don't they?).


What is Borneo's Data Discovery?

In simple words, Borneo looks at the underlying data in the datasets and detects all the sensitive info-types (or even custom one's specific to your company). Borneo automatically inspects the actual data present in the RDS, Presto Tables, Redshift, S3 Buckets and many more different sources using Machine Learning and some voodoo magic to identify the info-types present in those resources.

Using these results you can quickly have a full picture of your data and it's type. Is there a public S3 bucket containing Credit Card Numbers or SSNs? A huge red flag, that Borneo will tell you about. But that's more like a side quest for the scope of this post, you can pursue it here.

(Detecting Sensitive Data Across multiple SaaS and Cloud resources)


Now, What is Automated Governance Catalog?

Right now, we have these two separate services:


  • Borneo: Scans for sensitive data present in your cloud resources.
  • Acryl Data: Scans your cloud infrastructure's metadata and makes it available in a catalog that you can manually tag with terms.


Suppose the data these two services interoperated seamlessly, providing you with a way to automatically tag all the resources present in the catalog with their respective sensitive data found using business glossary terms defined in the Data Catalog. Further, humans should be able to spot-check and approve the automated proposals.

That's what we did.

Whatever Borneo finds in its real-time and scheduled data scanning, will now be automatically pushed to Acryl, saving you hundreds of hours of manual work.


Benefits of a Smart Catalog


A catalog based on metadata inspection alone will result in missing 33% of sensitive information


  • You don't have to rely on the metadata to identify what kind of data is present in the system.
  • It's all automated and happens in real time, no manual labor required to go through vague field names to tag the resources with correct terms. We do that even when you sleep.
  • Reduced scope of error, since a manual process can be tedious, it's easy to miss out on a lot of unstructured kind of fields like a column of type Raw Text, or JSON.
  • No on-going work required for newly added resources to the system, they're automatically detected by both, Acryl and Borneo and are tagged as soon as anything is detected.

(Sensitive privacy related Info-types detected by Borneo)


What differentiates us from others?


Open ecosystem and DataHub community-led product development

  • Having complete control over your metadata by avoiding vendor lock-in is important. It will be always possible to extract core metadata from the Acryl offering into open-source DataHub.
  • Data Discovery, Data Governance, Data Observability are not solved by software alone. The large DataHub community is generating best practices about how to improve ownership, how to achieve compliance outcomes etc. and the product is continuously evolving to reflect the learnings of a large community of data practitioners.


Compliance monitoring and active data management


  • Acryl DataHub allows defining compliance constraints at the dataset or column-level (e.g: Mandatory presence of glossary terms from a certain compliance taxonomy). Constraints being met/not-met give you a simple test of compliance/not-in-compliance.
  • Human-in-the-loop approval workflows support easily plugging ML based classifiers with sufficient safety.
  • Metadata analytics allow you to monitor datasets that are out of compliance at a domain, platform, team level/
  • Automated actions can be triggered in response to key events (e.g: PII term was attached to a column, schema changed etc.) to perform activities like retention, GDPR deletion etc./


Proven platform capabilities of DataHub architecture



Want to see it in Action?

Let's take this Hive table and see what it looks like in the catalog without the Borneo integration.

(A table containing log events on Acryl Dashboard prior to Borneo Integration)


Notice that there are no terms right now in the right most column, and you'd have to manually tag this table with the correct terms by looking at the field names.

What tags would you choose for the event_data field? Looks like a generic field that could contain a JSON object, which could be anything, correct? This scenario occurs more and more when you're dealing with Big Data and have multiple streams pouring into a single dataset.

Let's take a look at the data in this table.

(The raw data in the table logging_events)


Notice something? In the JSON value of the field there's a nested object called user containing really sensitive information. This nested object might not be present on all the rows of this table and a person checking this data manually could easily miss it, but Borneo won't.

Now let's take a look at how the catalog looks after the integration with Borneo.

(The table automatically tagged with relevant terms after integrating Borneo)


As is evident from this image, Borneo has automatically discovered the correct terms for this field and has tagged the fields with their corresponding terms automatically.

Ready for a test drive? Reach out to us for a demo!




What is Borneo?

Borneo helps security & privacy teams achieve continuous compliance and data protection through accurate & actionable data discovery.

Want to watch Borneo in action? Request a demo here and we will get back to you soonest.

Similar Posts

Privacy Observability — Why Is It Needed Urgently?

Teck Wu4/4/2023 - 4 Min Read

Why Is PCI DSS So Hard?

Sushim Mukul Dutta4/4/2023 - 5 Min Read

An approach: Solving PII detection in Unstructured Data with AI/ML

Teck Wu4/4/2023 - 3 Min Read

Choose real-time data protection. Choose Borneo.

Manage risk, increase trust, and accelerate innovation across your entire data ecosystem.