Brightcove
Support+1 888 882 1880
Products
Solutions
Resources
Company
Search IconA magnifying glass icon.
Talk to UsRequest a Demo

Back

Kelly Mahoney

By Kelly Mahoney

Growth Marketing Specialist at Brightcove

Auto-Captions: Limitations of Automated Speech Recognition

Media

Auto-Captions

The rise of generative artificial intelligence (AI) has taken the world by storm, finding applications in personal and professional spheres alike. In the captioning industry, AI can be used in the process of automatic speech recognition (ASR), which converts speech to text. While ASR technology has never been more advanced than it is today, our research shows that even the best engines perform below industry standards. This means humans are still a mainstay in producing high-quality, accessible captions.

The Accuracy of Auto-Captions

In the captioning world, accuracy rates are used to gauge the precision and quality of a caption file, subtitle file, or transcript. Since accuracy is crucial to providing a truly equitable accommodation for d/Deaf and hard of hearing audiences, the industry standard for minimum acceptable caption accuracy is 99%. But what does this really mean?

When measuring the accuracy of an ASR engine, there are a variety of factors to consider. As outlined by the FCC, “Accurate closed captions must convey the tone of the speaker’s voice and intent of the content.” Proper spelling, spacing, capitalization, and punctuation are key elements of accurate captions, as are non-speech elements like sound effects and speaker identifications.

Because ASR engines are driven by artificial intelligence, their capabilities are limited to what they’ve been taught via their programming. Despite continuing advancements, AI-powered technology doesn’t have the same capacity for logic or understanding context as a human being. In practice, relying solely on AI for transcription/captioning may produce spelling errors and inconsistencies that a human captioner may not.

3Play Media’s 2024 report on the State of Automatic Speech Recognition evaluated the performance and accuracy of 10 engines in captioning and transcribing pre-recorded content.

We uncovered that some engines are better suited for particular content (e.g., educational versus cinematic), which adds nuance to possible use cases for auto-captions. But overall, zero out of 10 engines produced output measuring over 95% accuracy, when looking at Word Error Rate (WER). Using that same metric to analyze accuracy by content type, we see a spectrum of results. While the WER in the Goods and Services market is relatively low, it almost doubles in the Tech market.

The discrepancy between different types of industry content demonstrates that ASR technology is still not independently sufficient to produce accessible captions. Ultimately, a human-in-the-loop approach to captioning offers the most potential for highly accurate output.

The Impact of Auto-Caption Inaccuracy

The repercussions of inaccurate captions may reach further than you think. People with disabilities and their families wield spending power in the billions, but their willingness to spend drops significantly when online experiences are inaccessible. With the 2023 WebAIM Million Report finding accessibility failures on over 96% of website home pages, this represents a real gap in potential revenue streams.

Not only do low-quality captions make content inaccessible, they can have a negative impact on your user experience across the board. The limitations of ASR make their transcripts more susceptible to substitution errors, hallucinations (text without audio basis), and formatting errors—which can confuse your audience and the algorithm. Further, video transcripts have an impact on SEO, which is an essential aspect of many brand marketing strategies.

Search engines rely on text associated with video content in order to index and rank results appropriately. This makes transcripts and caption files some of the strongest contributors to a site’s keyword density and relevant search rankings. If your brand relies solely on auto-generated subtitles and transcripts, errors could bog down your search strategy. Incorrect long-form queries and keywords create a disconnect between you, your target audience, and their engagement potential.

On top of the technical disadvantages, presenting poor-quality captions calls your whole brand into question. In the UK, 59% of consumers report that spelling errors and bad grammar would make them doubt the quality of services being offered. In other words, inaccurate captions undermine your marketing efforts and erode the confidence of your audience.

How to Use Auto-Captions Wisely

AI is an essential tool for creating auto-captions efficiently at scale. ASR-generated transcripts streamline captioning by providing a foundational first step for human editors to review. This eliminates the need for the manual timecode association, which is typically the most time-consuming part of caption production. So the combination of professional human transcriptionists and technology makes for a more efficient quality assurance process, while keeping costs low for customers.

3Play’s patented process creates highly accurate transcripts and media accessibility services using human professionals alongside top-of-the-line technology to guarantee an average measured accuracy of 99.6%. To make video accessibility easy, we integrate with popular video platforms like Brightcove to make it work where you already do. In addition to making content accessible and keeping up with compliance, the integration between 3Play and Brightcove increases the value of your video investment with one click.


BACK TO TOP