sun microsystems Systems Architecture Committee _________________________________________________________________ Subject: Virtualization and Namespace Isolation in the Solaris Operating System Submitted by: Andrew T. File: PSARC/2002/174/opinion.ms Date: July 30th, 2003 Committee: Bill Sommerfeld, Ralph C., James Carl- son, Richard McDougall, Terrence Miller, Andrew T. Abstain: Glenn Skinner, Shu- dong Zhou. Product Approval Committee: Solaris PAC 1. Summary This project introduces Solaris zones. Zones provide a means of virtualizing operating system services, allowing one or more processes to run in isolation from other activity on the system. This isolation prevents processes running within a given zone from monitoring or affecting processes running in other zones. Each zone is a sandbox within which one or more applications can run without affecting or interacting with the rest of the system, and can be separately admin- istered. All zones on a given system share the same Solaris kernel. Unlike physical or virtual machine partitioning, this allows for fine-grained partitioning of resources at the cost of requiring identical versions of the Solaris software to be installed in all zones on a system. As the kernel is shared, hardware and software faults fatal to the kernel will necessarily affect all zones. 2. Decision & Precedence Information The project is approved as specified in reference [1] The project may be delivered in a minor release of the Solaris Operating System. 3. Interfaces The project exports the following interfaces. _____________________________________________________________ | Interfaces Exported | |___________________|_________________|_____________________| |Interface | Classification | Comments | |___________________|_________________|_____________________| |zlogin(1) | Evolving | Utility to | | | | enter a zone | |zonename(1) | Evolving | Utility to get | | | | current zone | | | | name | |zoneadm(1M) | Evolving | Utility to | | | | administer | | | | zones | |zoneadmd(1M) | Project Private| Daemon which | | | | administers | | | | zones | |zonecfg(1M) | Evolving | Utility to con- | | | | figure zones | | | | | |zone_create(2) | Project Private| System calls to | |zone_destroy(2) | | manipulate and | |zone_enter(2) | | get information | |zone_getattr(2) | | about zones | |zone_list(2) | | | |zone_lookup(2) | | | |zone_shutdown(2) | | | | | | | |getzoneid(3C) | Evolving | Functions to | |getzoneidbyname(3C)| | map between | |getzonenamebyid(3C)| | zone id and | | | | name | | | | | |ucred_getzoneid(3C)| Evolving | Function to get | | | | zone id from | | | | credential | |crgetzoneid(9F) | Evolving | Kernel inter- | | | | face to access | | | | zone id | |zcmn_err(9F) | Evolving | Kernel inter- | | | | face to print | | | | to zone console | | | | | |ipcrm(1) | Evolving | Added -z option | | | | | |ipcs(1) | Evolving | Added -z and -Z | | | | options | | | | | |pgrep(1) | Evolving | Added -z option | | | | | |pkill(1) | Evolving | Added -z option | | | | | | | | | |___________________|_________________|_____________________| _____________________________________________________________ | Interfaces Exported | |___________________|_________________|_____________________| |Interface | Classification | Comments | |___________________|_________________|_____________________| |ppriv(1) | Evolving | Added -z option | | | | | |priocntl(1) | Evolving | Added -i zoneid | | | | option | | | | | |ps(1) | Evolving | Added -o zone | | | | and -o zoneid | | | | options | | | | | |renice(1) | Evolving | Added -i zoneid | | | | option | | | | | |coreadm(1M) | Stable | Added %z token | | | | | |ifconfig(1M) | Evolving | Added -Z, zone, | | | | and -zone | | | | options | | | | | |poolbind(1M) | Evolving | Added -i zoneid | | | | option | | | | | |prstat(1M) | Evolving | Added -z and -Z | | | | options | | | | | |mount_proc(1M) | Evolving | Added -o zone | | | | option | | | | | |priocntl(2) | Evolving | Added P_ZONEID | | | | id type | | | | | |getpriority(3C) | Evolving | Added PRIO_ZONE | |setpriority(3C) | | to possible | | | | "which" values | | | | | |priv_str_to_set(3C)| Evolving | Added zone | | | | token to refer | | | | to zone | | | | privilege set | | | | | |core(4) | Stable | Added | | | | NT_ZONENAME | | | | note type | | | | | |proc(4) | Stable | Added zone id | | | | to pstatus_t | | | | and psinfo_t | | | | | | | | | | | | | |___________________|_________________|_____________________| _____________________________________________________________ | Interfaces Exported | |___________________|_________________|_____________________| |Interface | Classification | Comments | |___________________|_________________|_____________________| |privileges(5) | Evolving | Added new | | | | privileges | | | | PRIV_PROC_ZONE, | | | | PRIV_SYS_ADMIN, | | | | and | | | | PRIV_NET_ICMPACCESS| | | | | |if_tcp(7P) | Evolving | Added | | | | SIOCGLIFZONE | | | | and | | | | SIOSLIFZONE; | | | | Added | | | | LIFC_ALLZONES | | | | for | | | | SIOCGLIFCONF | | | | and SIOCGLIFNUM | | | | | |if_tcp(7P) | Consolidation | Added SIOCREM- | | | Private | ZONEIFS | | | | | |TCP_IOC_ABORT_CONN | Contracted Con-| PSARC 2001/292 | | | solidation | (add ac_zoneid | | | Private | field). | | | | | |libzonecfg.so.1 | Project Private| zone configura- | | | | tion access | | | | library | | | | | |___________________|_________________|_____________________| 4. Opinion 4.1. Package Dependancies The "sparse root" model of install allows multiple versions of the same package to be installed in different zones; how- ever, because all packages need to run over the same kernel, inter-package dependancies may cause certain combinations to be uninstallable or otherwise unworkable when the zones exist on the same system. This is similar to existing prob- lems with multiple versions of the Java runtime; we will need to set customer expectations appropriately. This reso- lution is not particularly satisfying, but it's not clear what can be done given the constraints. 4.2. Managing complex configuration data structures. The committee is seeing an increasing number of projects that require complex structured configuration data, and has not seen particular unity in their approaches to the prob- lem. Several members felt that it would be appropriate for a system management SWG to look at this issue. While there is a project looking into a GUI for zone configurations and pools, the problem here is likely more complex than just the user interface. In addition, security evaluations/certifications have been observed to go more smoothly when tools more specialized than text editors are used for management, particularly from an auditing perspective. 4.3. Subtle semantic differences between global and local zones There was a concern that the implementation may contain sub- tle semantic differences between global and local zones that will take time to discover and understand. Several members of the committee feel that this particular design exposes significant complexities of the implementation in customer- visible ways, and system administrators will needs to under- stand much of the underlying technology in order to be able to predict what will happen to an application moved inside a local zone. The technology is clearly needed, but this par- ticular design may be one which our customers may have great difficulty adopting. These concerns caused one member to abstain from voting. 4.4. Completeness With a project of this complexity it is not clear how to tell that it is complete, and the number of known follow-on projects is a matter of some concern. The project team admittedly made some tradeoffs between schedule and com- pleteness, but it believes that the current feature set is sufficiently complete to be useful to customers; the commit- tee accepted this argument. 4.5. Device node visibility There was substantial discussion regarding the visibility of device nodes to local zones and the corresponding exposure of device drivers to zone-related issues. Indirect use of drivers by zone-aware subsystems (networking, filesystems) is not problematic as those subsystems insulate the driver from zone-boundary issues; this section concerns direct use of arbitrary device nodes/drivers by programs running in a zone. Several sub-issues were discussed: 4.5.1. Driver qualification for zones. The project team intends that direct use of drivers by local zones should generally be discouraged. Such access tends to introduce subtle security and fault containment issues that defeat some of the protections provided by the zones facil- ity. However, there are many pseudo devices included as part of the Solaris programming API, such as /dev/null, /dev/zero, etc.. Their drivers have been examined by the project team and deemed safe. Other drivers should be con- sidered local-zone unsafe until proven otherwise. In addition, customers have requested the ability to make special-case "I need this device in a zone" exceptions despite the security issues. As part of this project, the project team intends to provide internal and external documentation/guidelines for direct use of drivers by local zones. 4.5.2. Automatic device node export. The zones design provides for a rule-based mechanism to automatically export device nodes into local zones; rules can match based on the /dev pathname, the driver name, or a combination of driver name and minor name. This mechanism is primarily intended for dynamic creation of pseudo termi- nals, but the mechanism can also be used for hotplugged dev- ices. As the I/O group is attempting to move away from administra- tive interfaces based on driver names and major/minor pairs, this interface remains controversial, but the majority did not object to its inclusion in the project. One member feels that the case should not have exposed interfaces based on driver & minor names, and voted to abstain as a result. 4.5.3. Implementation The current zones implementation modifies devfsadm(1M) and devfsadmd(1M) to handle /dev entries in all zones. The I/O group is concerned with the complexity of devfsadm and would prefer a layered solution, with local-zone device nodes managed by a separate utility, using devfs events to handle dynamic devices; again, this is another case where a fully dynamic /dev filesystem (sometimes referred to as "/devfs") would have helped, by avoiding the need to modify devfsadm. 4.5.4. Failures of administrative commands. Administrative commands that may fail when run in a non- global zone should print error messages that clearly describe the reason for failure (which might include device availability, lack of sufficient privileges, or the fact that the operation is simply not allowed in a non-global zone). It should then be possible to deduce from documenta- tion that the failure resulted from limitations imposed on the zone. 5. Minority Opinion(s) None. 6. Advisory Information 6.1. Local/remote transparency for inter-zone traffic. The PAC should fund a project to add support for packet filtering/policy enforcement for looped-back network traffic between zones sharing the same system. 6.2. Certification/qualification of third-party applica- tions There needs to be an approach for supporting third-party applications -- especially applications with a kernel com- ponent, such as file systems -- in the presence of zones. 6.3. Management of complex structured data. We believe a system-management SWG should look into the gen- eral problem of managing complex structured configuration data. See section 4.2 above. 6.4. Dynamic /dev filesystem. The PAC should fund a project to add support for a dynamic "/dev" filesystem; see sections 4.5.2 and 4.5.3 above. 7. Appendices 7.1. Appendix A: Technical Changes Required None. 7.2. Appendix B: Technical Changes Advised None. 7.3. Appendix C: Reference Material Unless stated otherwise, path names are relative to the case directory PSARC/2002/174. 1. Virtualization and Namespace Isolation in the Solaris Operating System (design document) File: zone-design.pdf