By Daniel Suissa, Software Engineer at Sentra
Source code lies at the heart of every technology company’s business. Aside from being the very blueprint upon which the organization relies upon to sell its products, source code can reveal how the business operates, its strategies, and how its infrastructure is designed. Many of the recent data breaches we’ve witnessed, including those against industry leaders like LastPass, Okta, Intel, and Samsung, were instances where attackers were able to gain access to all or part of the organization's source code.
The good news with source code is that we usually know where it originated from, and even where it’s destined to be shipped. The bad news is that code delivery is getting increasingly more complex in order to meet business demands for fast iterations, causing code to pass multiple stations on its way to its final destination. We like to think that the tools we use to ship code protect it well and clean it up where it's no longer needed, but that’s wishful thinking that puts the business at risk. To make matters worse, bad development practices can lead to developer secrets and even customer information being stolen with a source code breach, which can in turn trigger cascading problems.
At Sentra, we see protecting source code as the heart of protecting an organization’s data. Simply put, code is a qualitative type of data, which means that unlike quantitative data, the impact of the breach does not depend on its scale. Even a small breach can provide the attacker with crucial intellectual property or intel that can be used for follow up attacks. That said, not every piece of code leaked can damage the business in the same way.
So how do we protect source code in the cloud?
All data protection starts with knowing where the data is and how it’s protected. We always start with the home repository, usually in GitLab, GitHub, or BitBucket. Then we move to data stores that are a part of the code delivery cycle. These can be container-management services like Amazon’s Elastic Containers Service or Azure Container Instances, as well as the VMs running that code. But because code is also used by developers on personal VMs and moved through Data Lakes, Sentra takes a wider approach and looks for source code across all of the organizations’ non-tabular data stores across all IaaS and SaaS services, such as files in Azure Disk Storage volumes attached to Azure VMs.
We said it before and we’ll say it again - not all data is created equal. Some copies of source code may include intellectual property and some may not. For example, a CPP file with complex logic is not the same as an HTML file distributed by a CDN. On the other hand, that HTML might accidentally contain a developer secret, so we must look for those as well before we label it as ‘non-sensitive’. Classifying exactly what kind of data each particular source code file contains helps us filter out the noise and focus on the most sensitive data.
Detecting Data Movement
At this point we may know where source code is located and what kind of information it contains, but not how where it came from or how to stop bad data flows that lead to unwanted exposure. Remember, source code is handled both manually and by automatic processes. Sometimes it’s copied in its entirety, and sometimes partially. Detecting how much is copied and through which processes will help us enforce good code handling practices in the organization. Sentra combines multiple methods to identify source code movement at the function level by understanding the organization’s user access scheme, activity, and by looking at the code itself.
Security efficiency begins with prioritization. Some of the code we will find in the environment may be properly separated from the world behind a private network, or even encrypted, and some of it may be partially exposed or even publicly accessible. By determining the Data Security Posture of each piece of code we can determine what processes are conducive to the business’ goals and which put it at risk. This is where we combine all of the above steps and determine the risk based on the kind of data in the code, how it is moved, who has access to it, and how well it’s protected.
Now that we understand what source code needs protecting against which risks, and more importantly, what are processes which require the code in each store, we can choose from several remediation tools in our arsenal:
- Encrypt. Often source code is not required to be loaded from rest very-quickly, so it’s alway a good idea to encrypt or obfuscate it.
- Limiting access to all stores other than the source code repository.
- Use a retention policy anywhere where the code is needed only intermediately.
- Review old code delivery processes that are no longer needed.
- Remove any shadow data. Code inside unused VMs or old stores that weren't accessed in a while can most probably be removed altogether.
- Detect and remove any secrets in source code and move them to vaults.
- Detect intellectual property that is used in non-compliant or insecure environments.
Source code is the data that absolutely cannot be allowed to leak. By taking the steps above, Sentra’s DSPM ensures that it stays where it’s supposed to be, and always is protected properly.