Benchmarks are as important a measure of progress in AI as they are for the rest of the software industry. But very often when the benchmark results come from companies, secrecy prevents the community from verifying them.
For example, OpenAI has granted exclusive licensing rights to its powerful GPT-3 language model to Microsoft, with whom it has a business relationship. Other organizations say the code they use to develop systems depends on internal tools and infrastructure that cannot be made public, or that proprietary datasets are used. While motivations can be ethical in nature – OpenAI first rejected GPT-2, the predecessor of GPT-3, out of concern that it might be misused – to the same effect. Without the necessary code, it is much more difficult for outside researchers to verify an organization’s claims.
“This isn’t really a sufficient alternative to industry open source best practices,” said Columbia Computer Science Ph.D. Candidate Gustaf Ahdritz informed TechCrunch via email. Ahdritz is one of the lead developers of OpenFold, an open source version of DeepMind’s protein structure prediction AlphaFold 2. “It’s difficult to do all the science that one would like to do with the code that DeepMind has released.”
Some researchers even go so far as to say that withholding a system’s code “undermines its scientific value.” In October 2020 a rebuttal released in the diary Nature had a problem with a cancer prediction system trained by Google Health, the arm of Google that focuses on health-related research. The co-authors noted that Google withheld important technical details, including a description of the system’s development, which could significantly affect its performance.
Instead of making changes, some members of the AI community, like Ahdritz, have taken it upon themselves to open source the systems themselves. Starting from technical documents, these researchers meticulously try to recreate the systems, either from scratch or building on the fragments of publicly available specifications.
OpenFold is such an effort. Started shortly after DeepMind announced AlphaFold 2, Ahdritz says the plan is to verify that AlphaFold 2 can be reproduced from scratch and to make available components of the system that might be useful elsewhere.
“We trust DeepMind has provided all the necessary details, but… we haven’t [concrete] Evidence of that, and so this effort is key to providing that path and allowing others to build on it,” said Ahdritz. “Also, certain AlphaFold components were originally licensed under a non-commercial license. Our components and data – DeepMind has still not released their full training data – will be fully open source to enable industry adoption.”
OpenFold isn’t the only project of its kind. Elsewhere, loosely knit groups within the AI community are attempting to implement OpenAI’s Codex for code generation and art creation DALL-EDeepMind’s chess game AlphaNulland even AlphaStara DeepMind system developed to play the real-time strategy game StarCraft 2. Among the more successful EleutherAI and AI startup Hugging Face’s BigScienceopen research effort aimed at providing the code and datasets necessary to run a model comparable (though not identical) to GPT-3.
Philip Wang, a prolific member of the AI community who maintains this a number of open source implementations on GitHub, including a DALL-E from OpenAI, posits that open-sourcing these systems reduces the need for researchers to duplicate their efforts.
“We read the latest AI studies like any other researcher in the world. But instead of replicating the paper in a silo, we implement it open source,” Wang said. “We are in an interesting place at the intersection of information science and industry. I think open source is not one-sided and benefits everyone in the end. It also appeals to the broader vision of a truly democratized AI that is not beholden to shareholders.”
Brian Lee and Andrew Jackson, two Google employees, collaborated on the creation MiniGo, a replication of AlphaZero. While not affiliated with the official project, Lee and Jackson – who were at Google, DeepMind’s original parent company – had the advantage of having access to certain proprietary resources.
“[Working backward from papers is] like navigating before we had GPS,” Lee, a research engineer at Google Brain, told TechCrunch via email. “The instructions speak of landmarks to see, how long to walk in a given direction, which turn to take at a critical point. There is enough detail for the experienced navigator to find his way, but unless you know how to read a compass you are hopelessly lost. You won’t retrace the steps exactly, but you’ll end up in the same place.”
The developers behind these initiatives, including Ahdritz and Jackson, say they will not only help demonstrate whether the systems work as advertised, but also enable new applications and better hardware support. Systems from large labs and companies like DeepMind, OpenAI, Microsoft, Amazon, and Meta are typically trained on expensive, proprietary data center servers with far more processing power than the average workstation, further increasing the barriers to open source.
“Training new variants of AlphaFold could lead to new applications beyond protein structure prediction, which is not possible with the original code version of DeepMind because the training code was missing — for example, predicting how drugs bind proteins, like proteins move and how proteins interact with other biomolecules,” Ahdritz said. “There are dozens of high-impact applications that require training new variants of AlphaFold or integrating parts of AlphaFold into larger models, but the lack of training code prevents all of this.”
“This open-source effort does a lot to spread the ‘working knowledge’ of how these systems can behave in non-academic settings,” Jackson added. “The computational effort required to reproduce the original results [for AlphaZero] is quite high. I can’t remember the number off my head, but it was about running about a thousand GPUs for a week. We were in a pretty unique position to be able to help the community try out these models with our early access to Google Cloud Platforms TPU Product not yet publicly available.”
Implementing proprietary systems in open source is fraught with challenges, especially when there is little public information. Ideally, the code would be available alongside the dataset used to train the system and so-called weights, which are responsible for converting the data fed into the system into predictions. But that’s not often the case.
For example, when developing OpenFold, Ahdritz and his team had to gather information from the official materials and reconcile the differences between various sources, including the source code, supplemental code, and presentations given by the DeepMind researchers early on. Ambiguities in steps such as data preparation and training code led to false starts, while a lack of hardware resources necessitated design compromises.
“We only have a handful of attempts to get this right so this doesn’t drag on indefinitely. These things have so many computationally intensive phases that a tiny error y can set us back so much that we had to retrain the model and also regenerate a lot of training data,” said Ahdritz. “Some technical details that work very well [DeepMind] don’t work that easily for us because we have different hardware… In addition, the ambiguity about which details are crucial and which are chosen without much thought makes it difficult to tweak or tweak and ties us to anything ( sometimes awkward). Decisions were made in the original system.”
So be aware that the labs behind the proprietary systems like OpenAI reverse engineer their work and even use it from startups start competing services? Obviously not. Ahdritz says the fact that DeepMind in particular is releasing so many details about its systems suggests it implicitly supports the effort, although it hasn’t said so publicly.
“We have received no clear indication that DeepMind disapproves or condones these efforts,” Ahdritz said. “But surely nobody tried to stop us.”
When big AI labs refuse to open source their models, the community steps in – TechCrunch Source link When big AI labs refuse to open source their models, the community steps in – TechCrunch