Alex F

solve et coagula

UX/IA/Research/Strategy/InfoSec

Musician

WARATEK MANAGEMENT CONSOLE

 

Waratek has some very unique technology surrounding Java and .Net application security. The high level architecture consists of Agents on endpoints, and a Management Console (MC) that manages the Agents, supplies them with policies and rules, etc. When I joined, it was immediately apparent that there were a number of problems within the Management Console.

Problem 1 - Rules:

When I started at Waratek they were in the middle of a transition. The MC was at version 2.x, and there were 2 types of rules: “Version 1 rules” (“v1 rules”), and ARMR version 1.x rules. ARMR is a unique proprietary language that allows a Waratek Agent to alter the behavior of any Java or .Net application in real time, in memory. Rules that are “exploit-aware” (generally in the form of being tailored around CWEs and CVEs) can alter an application’s behavior in real time and prevent attacks from succeeding. Where firewalls and other protections fail, Waratek’s agents succeed. The v1 Rules were old. There were only 11 or 12. They were not very flexible at all. They were difficult to understand, as you will see. Just looking at them, you had no idea what they did. And the next generation of Agents that were being created no longer used the v1 Rules.

The MC also utilized ARMR v1.x rules. Sort of. You could click on something in the UI to add or edit an ARMR rule, but all you got was a small black box in a slide out that was supposed to be a tiny little window where the user could type in ARMR code, but it was really just a black square. Even if it did work, the big catch here was that the user had to hand type the code in for the rules. The user was expected to, upon purchase, either thoroughly learn the ARMR language, or have the client satisfaction team custom write the rules for them. It was, to say the least, not a tenable situation. It was difficult not intuitive, and put an unreasonable burden on the user to learn the ARMR scripting language (YAML-based).

Problem 2 - Scalability

The architecture for the agents worked like this:

Agents (at that time were called “instances” and not Agents) were grouped together in groups of agents known as Applications (the nomenclature was also confusing, obviously). Those Waratek Applications had rules written for them. You could write a number of rules, apply it to a Waratek Application, and those rules would then get distributed to the Agents (Instances) grouped together in that Waratek Application. If you had another Waratek Application, you had to write all of the rules for that Application from scratch. And whitelisting was not something that was easily achievable in ARMR v1, if at all, and not possible in the v1 Rules. If you wanted to protect all of the files within a directory except 1 or 2, you had to write rules for every file in the directory except that one or two that you needed to exempt.

This is obviously not scalable at all. Nothing in the MC was “reusable” in any real way.

Problem 3 - Policy and Rule Management and Creation

Building on the issues surrounding scalability, if you had 2 Applications, or even 10, where the rules were similar, with the exception of a few slight differences (rule options on or off, slight changes in local paths, etc.) you could not simply copy the rules from one Waratek Application to another and make the necessary changes. You had to start from zero. Rules of either kind were each their own entity. In other words, there was no concept of a policy, or group of rules. Rule creation was complex and time-consuming, and management was a nightmare, which would lead to customers at times writing very simplified versions of a more complex ruleset, thereby dealing with the potential of creating rules that were either too broad or too narrow.

ARMR 2.x, coming from the development side, aimed to fix that. You had Mods, which were groups of rules, usually grouped together to serve a single function (more on that later), the language was expanded to include greater whitelisting capabilities, and better protection. ARMR 2.x was looking to be a huge leap forward, but the MC had no real way to deal with it other than provide a text window in which the user could type in the ARMR code that they would have to learn. This was still not a tenable solution.

Problem 4 - Workflows

There were no real logical workflows. Almost everything was done from slide-out drawers and you had to be on the right screen to get the right slide out drawer. There was no way to start broad and drill down into an event for the most part, with few exceptions. The main page had “welcome items” and checkboxes that never went away. The events graph was practically vestigial, and the events table contained all events, not just security events. It included when agents went on or off line, when new rules were applied, when rules were edited, and everything else that you can think of. It was basically a dumping ground for event i/o. No personas were defined. It was very clear that the MC team had never even spoken to their intended customers, and the development team for the MC had no experience in security, so they couldn’t even guess at what their users might find important or not. It was all a crap shoot.

Problem 5 - The look and feel

The look and feel was haphazard and looked like a web-based android app from the early days of Android. Large squares designed to bring your attention to unimportant things, very little consistency, tables that looked like they were part of the page with no title bar and sometimes no column headers. Compared to the competition in the marketplace, it was light years behind.

 

Solutions:

My work was obviously cut out for me. It was clear that the following things needed to change:

  • We needed to talk with customers. We needed to find out who the people using the MC were, if there were different types, and what their concerns and needs were, for starters. We also needed to get their input on features, issues with the UI, etc.

  • Noise needed to be reduced in the Events Table. ONLY security events should be in that table. More “administrative” type events needed to be in more appropriate places in the UI

  • Workflows needed to be completely redone. The workflows should be logical and follow a simple pattern so that a user can get to and do whatever they need in the MC in an intuitive fashion. If the user wanted to turn off a rule they shouldn’t have to figure out how to navigate to the page where the rule slide out drawer was, and then slide out that drawer. Navigating should be mostly natural and intuitive. It should also follow with typical even triage mental models for those that needed to dive deeper into security events.

  • Rules needed to be scalable. We needed a policy system for grouping rules, and the ability to copy and change policies, as well as import and export whole policies. We needed a way to import and export rules within a policy as well. Rules and policies should live independently from the Applications to which they were assigned. Assignments can change, and the effects of that need to be minimal, not monumental. A single policy should be able to be assigned to multiple applications, if necessary, without the need for recreating it from scratch.

  • With the removal of v1 rules and ARMR v1.x rule, and having everything replaced with ARMR 2.x, it was unreasonable to expect the users to buy a product, force them to learn a new language just to create a rule, and then expect happy customers. There needed to be an easy and simple way in the UI to create rules, even if they were fairly complex: The users shouldn’t have to learn a single line of code!

  • Nomenclatures needed to change as well. If everyone in the world is calling agents “Agents,” then why did we insist upon calling them “Instances?” Call them what they are. Call them what the security community knows and is used to. The same goes for “Applications.” The fact that you hear people say things like “You need to create a Policy, assign it to an Application in order to protect your application” is ridiculous. A Waratek Application is an Agent Group, so why not just call it that? It is easily understood and simple. This particular one, however, is a battle that I have not yet won. These days agents are called Agents and not Instances, but Applications are not yet called Agent Groups, and for new users, it is still confusing.

  • The UI needed to be brought into the 21st Century. We needed a real dashboard, and it needed to look like a professional UI. It needed to be on par with the competition. Contrast Security, for example, had an excellent modern UI, even if it is filled with noise and could use some trimming.

 

Workflows

After conducting a number of interviews and realizing that the majority of users fell under only 2 or 3 types of user (the SOC Analyst, and the Application Security Administrator), it was clear that the actual security event flows needed to follow the mental model that a Security Operations Center Analyst would expect. Security related items shouldn’t be on the same screen as application setup items and vice-versa, for example. Endless drawers were not the answer either. Here are examples of how the application started, and the direction that I proposed after discovery and analysis:

As you can see, the proposed workflow is an order of magnitude simpler. It follows a logical broad to detailed drill-down path, fits SOC analysts’ mental models of how triage is to be performed, and is very intuitive. No more “wait, how do I get to that screen?” again.

It also adds in the policy model for scalability and a rules wizard for ease of rule creation (more on that later).

You can see from the current 5.x workflow that my proposed 4.x workflow has remained largely intact, after 2 years of customer and stakeholder feedback.

 

Scalability, Policies, and Rule Creation and Management

I’m combining problems 1, 2, and 3 together here because they are inter-related, and it is less useful to discuss the solutions in isolation.

The short of it is as follows:

  • Group Rules and Mods together into a single container known as a Policy

  • Create a policy system by which policies live independently of Waratek Applications, and thus a single policy can be assigned to or unassigned from multiple Waratek Applications (agent groups) in a one to many relationship as opposed to a one to one relationship, thus increasing scalability

  • Allow policies to be copied so that there is no need to create new policies from scratch if they are to contain similar rules, thus greatly infreasing the manageability of rules and policies

  • Allow the ability to copy a single Rule or Mod from one policy to another (or several), allowing the full re-use of all policy components (2021Q3) adding further to the ease of policy and rule management.

  • Dumping v1 Rules and ARMR v1.x rules and replacing them with ARMR 2.x rules, getting rid of the code-entry window and the requirements that users learn a new scripting language by creating a Rule Wizard

A short note abut rule structure. You have individual Rules, and you have Mods (short for modification). A Mod is a container for rules. While technically you could create an entire policy within a single mod and export or import it everywhere, that isn’t really the purpose of a mod. One Mod use case might be to contain all of the rules for a specific feature. A common use would be to do something like create a rule denying access to the /usr/local/bin directory, and creating whitelist rules that ALLOW access to the files /usr/local/bin/text.txt and /usr/local/bin/Public.txt and then grouping all of those rules together into a single Mod. Every rule must be contained within a Mod, even if it is one by itself. This way policies are groups of Mods, which are, in turn, groups of Rules. While this could be confusing, the user can view the policy without looking at the Mods as seen in the picture on the far right, above.

The previous screens show how easy it is to create a rule by simply filling in a form that will generate the ARMR code for them. Users are no longer required to learn the language and type in something such as the following:

app("SQL controls for Oracle DB"):
    requires(version: "ARMR/2.5”)
	sql("controls for SQL operations"):
    	    vendor(oracle, options: [no-backslash-escapes])
	    input(database, deserialization)
    	    injection(failed-attempt, successful-attempt, permit: query-provided)
	    protect(message: "denying sql injections", severity: 10)
	endsql
endapp

It’s much easier from an Application Security Administrator in charge of creating the policies to simply point and click. Rules are easily edited in this interface as well, and even if they know the code, this prevents human error (forgetting a colon, for example), and take only a few seconds to create as opposed to the much longer length of time it would take someone to type out the code, look up the syntax, etc.

This was a huge leap forward in usability for the Management Console.

 

The Look and Feel

The MC 2.x had a dashboard page with a number of problems:

(1) The Welcome section was always there. Even if you completed everything. The main user for the dashboard would be a SOC analyst. Not only do they not care about the welcome screen, those who might (Application Security Administrators) don’t want to see it always, if ever.

(2) This was essentially a mirror of the left-hand navigation. It is redundant and there is no need for it on the main dashboard page.

(3) The security events, which are the main attraction here, and the main go-to for the SOC Analyst, is tiny and easily missed in the crowded UI.

(4) The “Recent Activity” table. It’s integrated into the page and doesn’t really look like it is a table representing the data in the Security Events section, plus other data not needed by the SOC Analyst persona. The SOC analyst doesn’t care if an Application were created or updated, if a rule was created, whether an Instance was acknowledged in the system (in fact, it was so unnecessary, instance acknowledgement was removed from the product as whole in MC 5.x). The 4 colored boxes contain more information that the SOC Analyst doesn’t care about. It’s just a LOT of noise, and very little actionable data on this page. The table doesn’t even have column headers, nor a way to sot the data!

This, obviously, had to change:

Here we see 2 versions of the dashboard. The one on the left was implemented around version 5.0 (actually just before). The 2nd example is the prototype currently under development.

Notice with the dashboard on the left, all of the fluff and noise has been eliminated. The graph and table underneath ONLY show security events, and nothing else. This is tailored directly to the SOC Analyst, who is the persona that actually uses the main dashboard. When the user logs in they can immediately see that there have been 40 security violations, all across a single application on a single machine (only 1 row in the table and opposed to several for violations across several agents). They immediately know what’s happening, to whom, when it happened, and if it is still happening.

The next generation dashboard follows a slightly different design principle, and is a true dashboard. It was designed with the OODA Loop in mind. The purpose of a dashboard should be to allow the user to see, at a glance (we’re talking seconds) an overall picture of security across their protected applications. The SOC Analyst needs to be able to see if there are any violations going on, if the attack is targeted or just someone trying anything they can, how severe the attempts are, if they succeeded and more in order to determine what event(s) are the highest priority and need immediate attention vs those that can wait. From HERE the Analyst can then take a deep dive into a particular incident or incidents, or pay attention to a specific host, and react in order to mitigate the threats.

This fills in the critical ORIENT aspect of the OODA Loop that I described as completely missing previously.

I can look at the first graph and tell if the attacks are targeting a specific machine(s) -in other words, is the attacker going after something specific - or if it is distributed across various applications, a sign that the attacker may just be throwing a lot of stuff out there just to see what sticks. Those are 2 very different types of attacks and require different actions on the part of the SOC Analyst and the Application Security Administrator. The next breaks down how it being attacked the most, what are the exploits that we are seeing being used, how often, and how many are critical vs medium severity.

All of this is used to guide the SOC Analyst along in their decision making process. We’ve taken care of the Observe and Orient parts of the OODA Loop, and armed with this the users will need to decide how to act.

 

Lastly, the UI look and feel needed an updating. Everything was amateur looking and in drawers, as noted before. But even converting them to pages didn’t help much. They needed an overhaul. Here are 2 examples of pages from 2.x (top row), and 2 examples of how they are changing under 5.x (bottom Row)

The other thing that the application needed was a consistency in design. To that end I created a style guide for the UI overhaul and it went through several iterations. Rather than detail it out, you can see it here. Prior, there was no style guide with the old MC UI.

Attached here is a PDF of an iteration of the style guide.