Summary
openpyxl started largely as the reverse-engineering of existing files. While this allowed the project to get up and running quickly, it also turned out to be a dead-end when it came to supporting more advanced features. This talk will explain how we've shifted to using descriptors, code generation and possibly the world's worst parser for more reliable code and faster development.
Description
Only after a great deal of public pressure did Microsoft agree to make the specification of Microsoft Office 2007 public and even this was after the fact. The result was many thousands of pages of narrative description and an extensive XML schema. Whatever one may think of the company or the format, the fact remains that for many offices Microsoft Excel is a standard format and we want to make it as easy as possible to work with the files with Python.
openpyxl started largely as the reverse-engineering of existing files. While this allowed the project to get up and running quickly, it also turned out to be a dead-end when it came to supporting more advanced features.
The Office Open XML Schema is, in effect, the domain-specific language for Microsoft Office. With tools like lxml it's even possible to work with this directly from Python but this requires extensive and detailed knowledge of something that is rarely intuitive and often downright confusing. Furthermore, manipulating XML in memory can be a problem on very large workbooks.
In openpyxl we introduced descriptors to deal with common types in the schema, and added class methods so that we could simplify parsing but keep the Python API as simple as possible. This started out as a fairly laborious transcription of schema to class definitions that has since been replaced by code generated by what might possibly be the world's worst parser – every library has to have some horrible code in it somewhere!
The result was that over 200 classes and their naive tests could be generated automatically to provide near complete support for the chart API. The process isn't perfect: code rarely runs straight out of the box but it has made it possible to support features that otherwise could only ever be partially supported at best.
The talk will cover what tools Python provides to make this kind of adventure succeed and where things could be improved.