Apache Knox
Perimeter Security

Perimeter security refers to the natural or "constructed" barriers to keep intruders out or captives within the boundaries of our solutions.
It assists in protecting the cluster resources and provides a single access point for all REST and HTTP interactions, simplifying client interaction.
Among its main benefits, we can mention:
- "Hides" specific URLs/ports, acting as a proxy.
- Simplifies authentication of multiple services and UIs.
- Enables SSL termination at the perimeter.
- Facilitates the management of endpoints.
- Provides detailed access logs.
Typical perimeter security includes technologies like proxies, firewalls, intrusion detection systems (IDS), and virtual private network (VPN) servers. A combination of all these provides reinforced security.

Knox Features
Apache Knox acts as a kind of application proxy, within the perimeter layer, where it receives requests intended for another server and acts on behalf of the client to obtain the requested resource.
It is a system created to extend and simplify the reach of Apache Hadoop services to users outside the cluster, without compromising the security of the ecosystem. Designed as a reverse proxy (a server that resides in front of one or more web servers, intercepting client requests with the aim of enhancing security, reliability, and performance).
Apache Knox integrates with identity management and SSO (single sign-on) systems and allows the identity of these systems to be used for access to Hadoop clusters.
Knox delivers three groups of user-oriented services:
- Proxy Services: Through HTTP resource proxying.
- Authentication Services: Authenticating access to the REST API, as well as WebSSO flow for UIs. LDAP/AD, Header-based PreAuth, Kerberos, SAML, and OAuth are options.
- Client Services: Using scripts via DSL or Knox Shell classes directly as SDK. The interactive script environment KnoxShell combines the interactive Groovy Shell with the Knox Shell SDK Classes for interaction with the data of the deployed Hadoop cluster.
Knox Benefits
The main advantages of the Knox Gateway are:
- Simplified Access: Knox extends the REST/HTTP services of Hadoop by encapsulating Kerberos within the cluster.
- Increased Security: Exposes the REST/HTTP services of Hadoop without revealing network details, providing SSL/TLS out-of-the-box.
Knox will delegate external client requests to the corresponding Hadoop services and, before delegating, provides all security services configured in the cluster:
- Centralized Control: Enforces REST API security "centrally," routing requests to multiple Hadoop clusters.
- Corporate Integration: Supports LDAP, Active Directory, SSO, SAML, and other authentication systems.
- Demonstration LDAP: Available by default in Apache Knox.
- Audit Logging.
Apache Knox Architecture
Apache Knox acts as a single point of contact for Apache Hadoop services in the cluster.
It runs as a server cluster, in the DMZ (demilitarized zone - situated between a trusted network and an untrusted network, providing physical isolation between the two), isolating the Hadoop cluster from the rest of the corporate network.
Its main feature is the provision of a security perimeter for Hadoop REST APIs, restricting the number of network endpoints required to access the Hadoop cluster.
As a result, it "hides" the topology of the Hadoop cluster. At the network perimeter, it provides a single point of authentication and token verification.
Knox can be used with both Kerberized (protected by Kerberos) and non-Kerberized clusters.
In an enterprise solution employing Kerberos-protected clusters, it provides security that integrates well with enterprise identity management solutions, and as mentioned earlier, protects the implementation details of the Hadoop cluster and simplifies the number of services with which a client needs to interact.
The deployment architecture of the Knox Gateway can be understood in the following diagram:

How Apache Knox Works
The Apache Knox Gateway is an application gateway for interacting with the REST APIs and UIs of Apache Hadoop deployments.
It provides a single point of access for REST and HTTP interactions with Apache Hadoop clusters. To this end, it offers three groups of user-oriented services:
- Proxy Services: Provides access to Apache Hadoop through HTTP resource proxying.
- Authentication Services: Authentication for REST API access, as well as WebSSO flow for UIs. LDAP/AD, Header based PreAuth, Kerberos, SAML, OAuth are all available options.
- DSL/SDK Client Services: Client development can be done with scripting through DSL or Knox Shell classes directly as SDK. The interactive script environment KnoxShell combines the interactive groovy shell with the Knox Shell SDK classes for an interaction with data from the deployed Hadoop cluster.
The Knox Gateway was designed as a reverse proxy with consideration for plugability in areas of policy enforcement, through the providers and back-end services to which it proxies requests.
Policy enforcement covers authentication/federation rules, authorization, auditing, dispatch, host mapping, and content rewriting. Policy is enforced through a chain of providers that are defined in the topology deployment descriptor for each Hadoop cluster controlled by Knox. The cluster definition is also made in the deployment descriptor and provides the Knox Gateway with the layout of the cluster for the purpose of routing and converting between user-facing and internal cluster URLs.
Each Hadoop cluster protected by Knox has its set of REST API represented by a single application context path specific to the cluster. This allows Knox to protect multiple clusters and present to the REST API consumer a single endpoint for access.
With the writing of the deployment descriptor in the topologies directory of the Knox installation, a new Hadoop cluster definition is processed, policy enforcement providers are configured, and the application context path is made available for use by REST API consumers.
Knox also complements Kerberos protection of clusters well, providing appropriate network isolation a
- integrating well with enterprise identity management solutions.
- protecting the deployment details of the cluster (hosts, ports).
- simplifying the number of services with which clients need to interact.

Apache Knox Resources
-
Configuration for New Services and UIs: Apache Knox provides a targeted method of configuration to add new routing services. This enables new Hadoop REST APIs to be quickly and easily incorporated. This functionality was added in release 0.6.0.
-
Homepage: Apache Knox provides a HomePage that can be used as a "front-door" to implementations and resources that are published for access through it. It is a good alternative for distributing a link to the administrative interface and obtaining Quick Links.
-
Authentication: The authentication function providers are responsible for collecting the credentials presented by the API consumer, validating them, and communicating the success or failure of the authentication to the client or the rest of the provider chain.
Knox comes with the Shiro authentication provider, which leverages the Apache Shiro project to authenticate BASIC credentials in an LDAP user repository. Support for OpenLDAP, Apache DS, and Microsoft Active Directory is available.
-
Federation/SSO: For clients that require credentials to be presented to a limited set of trusted entities within the enterprise, Knox can be configured to federate the authenticated identity from an external authentication event. This is done through providers with the federation function. The set of ready-to-use federation providers include:
-
Standard KnoxSSO Form-Based IDP: The default KnoxSSO configuration provides a form-based authentication mechanism that leverages Shiro authentication to authenticate in LDAP/AD with credentials collected from a form-based challenge.
-
PAC4J: The pac4j provider adds authentication and federation capabilities such as SAML, CAS, OpenID Connect, Google, Twitter, etc.
-
HeaderPreAuth: A mechanism for propagating identity through HTTP Headers that specifies the username and group for the authenticated user. This has been used for use cases like SiteMinder and IBM Tivoli Access Manager.
-
-
Knox SSO: The Knox SSO service is an integration service that provides a normalized SSO token to represent the authenticated user.
This token is generally used for WebSSO resources for participating UIs and their consumption of Hadoop REST APIs.
Knox SSO abstracts the actual identity provider integration from the participating applications so that they need only be "aware" of the KnoxSSO cookie.
The token is presented by the browser as a cookie, and applications participating in the KnoxSSO integration can cryptographically validate the presented token and remain independent of the underlying SSO integration.
-
Authorization: The authorization function is used by providers that make access decisions for requested resources based on the effective user identity context.
This identity context is determined by the authentication provider and the identity assertion provider mapping rules.
The evaluation of the user identity context and group principals against a set of access policies is done by the authorization provider to determine whether access should be granted to the user for the requested resource.
Knox comes with an ACL-based authorization provider that evaluates rules that include username, groups, and IP addresses. These ACLs are bound and protect resources at the service level. That is, they protect access to the Hadoop services themselves, based on the user, group, and remote IP address.
-
Auditing: Auditing allows determining which actions were performed by whom over a specific period.
The installation is built on an extension of the Log4j framework and can be extended by replacing the "out-of-the-box" implementation with another.
Recommendations and Best Practices
- Integration of Knox with Apache Ranger is recommended to check the permissions of users who wish to access the cluster resources.
- Enabling SSL is highly recommended. In the case of disconnected Hive connections via Apache Knox, it is recommended to modify the connection timeout and the maximum number of connections in the gateway site configuration file.
- To improve response times through the Apache Knox configuration, configure the following properties in the gateway.site:
gateway.metrics.enabled=false
,gateway.jmx.metrics.reporting.enabled=false
,gateway.graphite.metrics.reporting.enabled=false
.
Supported Services by Knox
The following services of the Hadoop Ecosystem have integration with the Knox Gateway: (these services can be consulted on the community page - item "Supported Apache Hadoop Services" / "Supported Apache Hadoop ecosystem UIs")
Supported Components by Knox
Component | SSO | Proxy (API) | Proxy (UI) |
---|---|---|---|
Ambari | YES | YES | YES |
Métricas Ambari/Grafana | |||
Atlas | YES | YES | YES |
HBase | YES | ||
HDFS | YES | ||
Hive(via JDBC) | YES | ||
Hive(via WebHCat) | YES | ||
MapReduce2 | YES | YES | |
Zookeeper | YES | YES | YES |
Spark2/Spark History Server | YES | YES | |
WebHCat | YES | ||
WebHDFS | YES | ||
YARN | YES | YES | YES |
Zeppelin | YES | YES | YES |
TDP v2 - Knox - Services Supported by Knox with Proxy and SSO (Kerberized or non-Kerberized Clusters)
Apache Knox Project Details
Apache Knox was developed in Java.

Sources: