NIST Common Data Format for Digital Voting Systems

From Trust The Vote

Jump to: navigation, search

Members of the TrustTheVote CoreTeam will attend a NIST workshop [1] on "Common Data Format for Electronic Voting Systems," this October 29-30, 2009 to be held in Gaithersburg, MD at the National Institute of Standards and Technology (NIST).

Related to this, we recently announced [2] a joint project with the Overseas Vote Foundation to develop a proposed draft standard means of digital exchange of voter registration data. And the growing movement toward standards in elections and voting data representation, interchange and transaction provides an opportunity for the OSDV Foundation and TrustTheVote Project to advance our positions and proposals on this subject.

There is a pressing need to identify and agree upon a set of requirements for a common data format for voting systems. To date, there is little consensus on the requirements for this format or what it is to accomplish.

We're interested in the issues of interoperability of different equipment, "auditability" (sic), transparency, publishing (e.g., communication with consumers of election data, including the public and media), integration between polls and registration, applications in digital record-keeping, and ensuring transparency in process (compilation, analysis, and availability).

A summary of some of our positions can be found in the 5-page Position Paper that we submitted to NIST for inclusion in the Workshop. Further information is provided below on some of the specific topics that NIST requested comment on, in the CFP for the workshop.

Contents

Selective but Broad Scope: Complete Election Lifecycle

For the scope of activities towards a Common Data Format (CDF), we favor a broad scope that includes the full range of IT-supported activities in the lifecycle of an election. However, we also believe that in the near term, the most fruitful work would be on developing Data Format Definitions (DFDs) for use at specific points in a system architecture. Those points would be the interfaces between major components of a broad election IT system: a digital voter registration system (DVRS), an Election Management System (EMS), a ballot design component, devices for ballot casting and counting, facilities for configuring and managing such devices, a tabulation components, and reporting components. Our TTV architecture is described more fully in our TrustTheVote Technology White Paper. Also, our TTV System Block Diagram shows the dataflows at the interfaces between these components.

Broad Scope for Logging, Externalization, Publication, Transparency

We have a strong focus on building operational transparency into all elections and voting systems applications. As a result, we place equal emphasis on data format definitions for both log data and operational data. We do not characterize event log data as low level and useful only for auditing. Rather, we view comprehensive logging as a central function of every election IT system and component. Likewise, externalization of log data in CDF should be a central public benefit of every election IT system and component, enabling publication of logs, and creating operational transparency and public accountabilty.

However, in this area, we don't view DFD development as an activity with near term value. Rather than defining for format for log data of election IT systems, we advocate a near term focus on content -- for each system and component, what are the events to be logged, and what event-specific log data must be captured? There is considerable work to be done on answering these questions in detail. With those details, the process of defining DFDs should be no different other, non-election-specific activities for defining event log structures and data formats.

Limited Scope for Auditability

Despite the focus on logging, we don't view audit support as a critical near term area of work toward election CDF. We agree that DFDs, especially including those for log data, are important data artifacts for election audits. We understand the concern that the ability to do effective audits rests in part on the DFDs being human readable and comprehensible by auditors. However, we don't share the assumption that the primary mode of auditor access to this data is via directly examining the literal form of log data that has been exported in a common interoperable data format. As a result, we don't think that audit-ability testing of DFDs is a key short-term goal.

Instead, our approach is to view audit support features not as a requirement for data representation, but as a requirement for software development activities that can re-use, extend, or define data formats as needed for audit support features. Starting with specific application's functionality (e.g., a central ballot counting device's function of accepting human input to override erroneous overvote interpretations of ballot images), and the log data recorded (about the operator, the ballot, and the changes to that ballot's votes), we can consider the requirements for this body of data to be effectively used in an audit setting. Hence, the audit-support component of our system architecture will be driven by these requirements, and usability testing with auditors will be of the audit-support component, rather than testing of the readability of the plain text of any dataset.

In this approach, the role of CDF is to be able to externalize log data, and other audit-relevant data, exported in a common interoperable data format, for interoperability with other reporting or auditing systems.

Interoperability Testing Rather Than Conformance Testing

We believe that in many cases, interoperability is a more useful short term goal than conformance. In the metaphor of crawl-walk-run, conformance testing is most appropriate when walking has been occurring for some time, i.e., a data format definition that quite mature, such as election results. Given experience from existing pilots with election result reporting in a common data format (e.g. California's use of EML 6.0 section 530 for election result reporting), that experience enables the creation of public benefit by conformance testing: if multiple states report results with the same data format, and consumers of the information obtain results data from multiple states, then conformance testing can ensure that the states do in fact use the same format, so that consumers can correctly merge data from multiple states. With less mature DFDs that may be still evolving, demonstration of interoperability both has intrinsic value and also helps evolution towards maturity in which conformance testing is beneficial.

Flexibility and Extensibility

This notion of less mature DFDs is an important one for us. We believe that the short-term focus of CDF work should be iterative extension of existing DFDs, developing new ones when required, in work that's based on specific use cases of data interchange at specific points in an architecture of election IT systems and components. As a result, we suggest that initial work on application-specific data formats be done with a goal of flexibility and extensibility of formats and early implementation of software that supports the formats.

Before a data format definition (DFD) reaches real maturity, there will likely be some goals that could be called "get the data out" meaning that real systems have useful data that isn't yet well represented in an evolving DFD. During that evolution, there will certainly be cases where an evolving DFD does not include expressive power to represent some existing system's data. In these cases, human comprehension and system interoperability can be achieved by using the DFD "as far as it goes", and "getting the data out" using an extension method of the DFD itself. Perhaps the most-used example of this approach is SMTP's allowing for mail header fields for application-specific or vendor-specific extensions -- so-called "X-headers". In an election dataset instance using an extension mechanism, interoperation is supported by the use of "standard" fields, and "get the data out" is supported by the of X-fields. A similar approach has proved useful in the context of voter registration requests, where some forms of addressing information are best represented by free form data rather than by using a complex structuring scheme.

Human Readability, Machine Readability, and Mark-up Languages

Readability is important, particularly for DFDs that are a work in progress. Human readability can accelerate participation in the work of extending an evolving DFD, while lack of readability can inhibit comprehension and reduce participation. The same is true of the capability for human manipulation of datasets that are represented by DFDs that are a work in progress.

We believe that the very important goal of human readability of DFDs (and datasets expressed using them) can be met without a trade-off with machine-readability. We believe that readability does not depend on the syntax of a DFD; the literal format of a dataset does not have to be the medium of "readability". It's true that some people in some situations will view DFDs and datasets by direct examination of literal representation of a DFD, or a dataset in that DFD, viewing as plain text a body of XML, YAML, CSV, etc. However, we don't assume that direct examination is the primary mode of access, such that one must pick a single mark-up language with the best readability properties.

For example, in some of our software development projects, we use the Ruby on Rails application framework, in which YAML is a built-in syntax for importing and exporting data. In some cases, the structure of particular data record is very similar to a record type of EML; we expect to externalize such data in EML in situations that require interoperability. But we also use CSV as a convenient format for some application-specific configuration data that is periodically manipulated by people; a spreadsheet program seems a better tool (for seeing the structure of the data and doing bulk manipulation of some data) than text editing a YAML file. Each of XML, YAML, and CSV have their own uses for representing the same data formats, at least in our work. And as the use of CSV and a spreadsheet tool shows, the literal format of the a dataset does not have to be the medium of "readability" -- and the same applies to XML, which can be viewed with a variety of tools including some spreadsheet applications. XML itself has many advantages in the context of interoperability between systems, or publishing of data. Certainly, we anticipate meeting feature demands for our software to externalize some types of data in XML, perhaps with schemae defined by EML, particularly if EML evolves to be more usable in meeting the needs of U.S. elections processes.

As a result of these experiences and observations, we believe that human readability can easily be accommodated in a variety of settings. Very likely a single syntax (likely XML) will be used to define U.S. standard DFDs for interoperability of election technology used in the U.S. But in the meantime, the emphasis of CDF work should be extending existing data format definitions, and developing new ones when needed, without a dogmatic insistence on use of any one syntax or markup language.

Factor Out Mechanisms for Data Provenance

Provenance of datasets is a very important issue. For audit purposes, it is required for auditors to gain assurance that a particular dataset was the actual dataset created or used at a prior point in time of the election process that is being audited. Data authentication and integrity mechanisms are essential. However, we do not believe that early work on evolving DFDs is work that needs to include data provenance. We prefer instead to work on iteratively creating and using DFDs to evolve them to a degree of completeness that meets needs of real deployments. At the point where such a DFD is supported by software to be used in a real deployment, that software could and should use data provenance methods that are orthogonal to the data representations themselves, e.g. OASIS standards for digital signatures of XML datasets. We particularly wish to avoid having a collaborative process of DFD development becoming encumbered with specific (and potentially standardizable) structures of and use of specific public-key infrastructure (PKI), trust models, processes, and agents.

Personal tools