Traditional rule-based approaches have failed cumbersomely because they lack the ability to understand the context in which sensitive information occurs. This leads not only to false negatives, but also false positives, which creates too much noise in alerts to be deemed useful for security practitioners to use. Many a times, this leads to their DLP solutions getting switched off, which defeats the purpose of having one in the first place.
We at Borneo are striving to build the next generation inspection engine, so that our customers can have their sensitive information accurately detected and remediated. To do so, we first need to identify what the engine needs to solve first.
Imagine detecting sensitive information as a game that an AI has to solve. How does it know how to solve it?
For one, we can feed it data, and it can learn from that data. The problem with this approach is that data is unbounded: theoretically, any data that can possibly be created can be created, and some of these data can be completely new to the world. Furthermore, sensitive personal information is very scarce in data systems. This means it is very difficult to gather enough training data to enable AI/ML models to understand enough context to predict on data created in the future.
Instead, we can give the model rules, rules that we know will work. These do not have to be learnt by the model, as they can be easily programmed into it. This can easily cover the majority of cases in any data system. However, the problem is that human contexts are usually biased to their experience: in the form of language, communication, terminology, short-forms etc. What this means is that we can help the model solve for the known knowns, but outside of that, we will need other techniques.
So in brief, known knowns are patterns that have familiar contexts, for which rule-based systems can be designed. They are what we deem the easy tasks, and usually are easily solvable by basic models.
So what else is out there?
Well, what's the opposite of known knowns? Known unknowns! So what about them? Well, they are quite difficult to solve for, because they are the data that we know exist in the systems, but do not know the rules to build into the model to detect them. These usually turn out to be false negatives for basic models, which usually can be hard to find out because they do not get surfaced in any form of alert.
So how do we solve for known unknowns? Of course, we can pre-label a huge dataset with the the model built around known knowns, but where do we get that huge dataset from? Furthermore, most customers are sensitive towards data sharing and hence could not provide us full access to their data to train on. So have we hit a roadblock?
In fact, we have already cracked how to solve for them with multiple ways. Seasoned machine learning practitioners will probably be able to speculate based on the above paragraph. We will leave this as a puzzle for you to figure out ;)
So, we have dealt with the known un(known)s (hopefully). What else is missing?
What about unknown unknowns?
Unknown unknowns? What are they even?
Well, as a company scales, evolves, and churns, undeniably not everyone has a complete understanding of what kind of data is collected. This might even be exacerbated if the company operates in different data silos. What a headache for compliance! So basically, unknown unknownsare the sensitive data that those who are tasked to list out what to detect might not be privy about. They are usually Personally Information (PI) instead of Personal Identifiable Information (PII).
PI data refers to attributes that are directly or indirectly linked to an individual's identity such as a name, location, identification number, online identifier or any physical, genetic, economic, cultural or social feature associated with an individual person.
PII data commonly refers to information that can be used to uniquely identify an individual either directly or when used in conjunction with other forms of information. An individual's date and place of birth, mother's maiden name, social security number, driver's license number, passport number and fingerprints are all considered to be forms of PII data.
From It's Time to Invest in a Privacy Stack, by Mark Settle
Detecting unknown unknowns requires understanding what kind of data is usually PII in a company, which would translate to the patterns that have to be detected. For instance, some companies have terminologies for internal IDs, like patient IDs, customer IDs, employee IDs, which can be used in DB joins to get a full set of PIs that are ripe for exploitation. (We will dive deeper into a blog post on how we detect unknown unknowns in structured data in another blog post, so stay tuned).
Companies who want complete coverage should use our unknown unknowns detection solution we call Predictive Infotypes, which are built into the platform itself. It predicts whether certain patterns are likely sensitive, based on contextual features and other infotypes that are detected around it.
In conclusion, there are three kinds of detection that we need to solve for, in order to have complete PII and PI coverage:
- Known knowns
- Known unknowns
- Unknown unknowns
First principles thinking is a very useful tool to solve engineering problems. This form of thinking is preached by Elon Musk often in his approach to tackle the most complex engineering problems in our lifetimes. To learn more about First Principles Thinking, you can visit here: https://fs.blog/first-principles/.
Thanks for reading! We will have a part 2 on how we approach inspection to solve for different content formats.