Amazon Polly - Scaler Topics

Overview

With AWS providing numerous cloud services that help us to handle the huge amount of data that we are generating these days, the AWS Polly is a service that turns text into lifelike speech, allowing you to create applications that talk and build entirely new categories of speech-enabled products. With its dozens of lifelike voices across various languages, we can build speech-enabled applications that can work across various countries.

What is Amazon Polly?

Long back people expected mechanically generated speech to sound precise, clipped, and devoid of human emotion, fast forward now, we see that many great applications and use cases are being offered for computer-generated speech, known as 'Text-to-Speech or TTS'. With industries like public announcement systems, entertainment, telephony, assistive apps, gaming, e-learning, etc are just a few starting points.

To address these use cases Amazon introduced its Text-to-Speech or TTS conversion service with added benefits named Amazon Polly. The Amazon Polly is a cloud service that converts text to humanlike speech which then can be used with your own tools, content, and applications. It allows you to create applications that can talk, and build entirely new categories of speech-enabled products. With its dozens of humanlike voices across various languages, you get the capability to build speech-enabled applications that can work across various countries. With Amazon Polly, you get a total of 47 different human (male & female) voices spread across 24 languages, with its additional languages and voices on the roadmap.

In addition to the Standard TTS voices, Amazon Polly also offers the Neural Text-to-Speech (NTTS) voices that give your the capability to deliver speech content with advanced improvements in voice and speech quality with its new machine learning approach. The Amazon Polly’s Neural TTS technology also supports the Newscaster speaking style which caters to news-associated use cases.

Below are a few voices that Amazon Polly supports: AMAZON POLLY VOICES

Finally, with Amazon Polly Brand Voice you can create a custom voice for your company which can help you connect with your employees and offer a better-customized engagement. For this, you shall be working with the Amazon Polly team to build that NTTS voice for exclusive use in your organization.

Features of Amazon Polly

Now let us talk about the various features AWS Polly provides to its users:

Enhancing the Visual Experience by Synchronizing the Speech With Amazon Polly, you can simply make a request for an additional stream of metadata that would help with information about some specific sounds, sentences, and words that get pronounced. This synthesized speech audio stream along with the metadata stream helps us to build the applications having an enhanced visual experience, like speech-synchronized karaoke-style word highlighting or facial animation.
Easy-to-Use API With Amazon Polly, we get an API that enables us to easily and quickly integrate speech synthesis with your application. With the Amazon Polly API, we simply just send the text that we want to convert into speech and in response, Amazon Polly returns the audio stream to the application. Now the application begins to stream it directly or sometimes can even store it in a standard audio file format, like MP3.
Platform and Programming Language Support Amazon Polly provides support with all the programming languages included in the AWS SDK like Java, Ruby, Go, Python, C++, Node.js, .NET, and PHP) along with AWS Mobile SDK (iOS/Android). With Amazon Polly, we can also implement our own access layer as Amazon Polly supports an HTTP API.
Maximum Duration for Speech is Adjustable With Amazon Poly you get the feature to easily facilitate the dubbing process if you first had the French speech embedded in the training video that your application published and now you may want to localize it into say, German, then first you need to translate the text using Amazon Translate and then you can move ahead to voice it with Amazon Polly. You need to note that it is important to let the localized German speech stream in the corresponding frames of the video, so the German speech cannot be longer than the French speech. Hence, we can say that Amazon Polly enables you to automatically adjust the speech rate based on a maximum allotted amount of time you define with a feature called 'time-driven prosody' which may be beneficial for use cases like localization.
Vast Selection of Voices and Languages With dozens of human-like voices and support features for a variety of languages, you get the flexibility to select the ideal voice and distribute speech-enabled content across the globe. Amazon Polly offers the Standard TTS voices, along with Neural Text-to-Speech (NTTS) voices which help to improve the speech quality for more natural and human-like voices.
Streaming Audio can be Optimised You can stream all kinds of information with your application to the end customers in near real-time with the Amazon Polly feature of the live stream made possible for the content. You get a wide range of sampling rates to optimize bandwidth and audio quality for your content and application. It supports MP3, Vorbis, and raw PCM audio stream formats.
Adjust Speech Rate, Pitch, Speaking Style, and Loudness With Amazon Polly, you get the capability of Speech Synthesis Markup Language (SSML)( a W3C standard, XML-based markup language for speech synthesis applications)along with supporting the common SSML tags for phrasing, emphasis, and intonation. The Custom Amazon SSML tags provide unique options, like the ability to make certain voices sound in a Newscaster speaking style. This capability helps you to create a humanlike speech that attracts, engages, and holds the attention of your customers.
Custom Lexicons With Amazon Polly’s custom lexicons (vocabularies), you get the flexibility to modify the pronunciation of specific words like company names, foreign words, acronyms, and neologisms (for example the “ROTFL”, “C’est la vie” when spoken in a non-French voice). This pronunciation can be customized by simply uploading an XML file with lexical entries.

CUSTOM LEXICONS

Brand Voice The capability offered by Amazon Polly is known as Brand Voice which is an awesome customized speech engagement that you can build with the exclusive Amazon Polly team to build a Neural Text-to-Speech (NTTS) voice for the exclusive use of your organization. The Brand Voice enables you to differentiate your products and applications with its unique vocal identity that serves a wide variety of use cases like Amazon Connect and Alexa Skills integrations. You get support from the Amazon Polly team throughout the entire process to help you identify the persona, or identify the specific actor or actress and record their speech, which can ultimately be built and trained into a model to produce the voice like it. Then this voice is made available to your AWS account ID(s).
Speech Synthesis via API, Console, or Command Line With Amazon Polly, you get the synthesis to be accessed by either the Polly API (and various language-specific SDKs) or AWS Management Console, along with one of the options being the AWS command-line interface (CLI). You get full control over all the capabilities of Amazon Polly, that is, you can use the service through the console, the API, or the CLI.

Benefits of Amazon Polly

Listed below are the five major benefits that Amazon Polly offers to its customers which make it different from other text-to-audio converter tools available in the market:

Real-time streaming If you require consistently fast response times for your content that deliver humanlike voices to provide a great conversational user experience then you can enable Amazon Polly to help you live to stream your content. Sending the text to Amazon Polly’s API will return the audio to your application just like a live stream that enables you to live stream or play the voices immediately.

Low-cost With Amazon Polly’s pay-as-you-go pricing, it definitely gives a low cost per character that gets converted, along with unlimited replays. These pricing features of Amamoxn POly make it a cost-effective way to voice your content to your audience.

Control and Customise the speech output With Amazon Polly, you can modify the voices that best suit your needs. As Amazon Polly supports lexicons and SSML tags this enables you to control aspects of speech, like pitch, speed rate, volume, pronunciation, etc.

Human-Like sounding voices As Amazon Polly provides dozens of languages with a wide selection of natural-sounding human (male/ female) voices, this makes the audio for your converted text human which give your customer a more conversational experience with your content. The fluid pronunciation of content that Amazon Polly enables, helps to deliver high-quality voice output for the global audience.

Capture & redistribute speech With unlimited replays of the already generated speeches that too at no additional fees, allow you to convert, store and redistribute your speeches for your content helping it reach a wider audience. With Amazon Polly, you can create speech files in standard formats like MP3 and OGG, and then serve your customers from the cloud or locally with apps or devices for offline playbacks.

The below image represents all the benefits that Amazon Polly offers to its customers:

How Amazon Polly works?

Now let us understand how Amazon Polly works when we integrate it with our platform service and help to provide its major benefits and features.

We all stated that Amazon Polly converts the input text into human-like speech where when you call one of the speech synthesis methods, you first provide the text that you need to synthesize, after which you can choose one of the 'Neural Text-to-Speech' (NTTS) or 'Standard Text-to-Speech' (TTS) voices, and then specifically decide the audio output format. By doing so, Amazon Polly starts to synthesize the text into a high-quality speech audio stream in one of the selected voices you specified.

The below diagram shows one of the ways how you can integrate the Amazon Polly and it can start to work:

AMAZON POLLY INTEGRATION

Let's expand into how Amazon Polly is working to convert the text to audio speech: [Text--> Ntts or TTS--> specifically audio(MP3/OGG etc) --> Amazon poly then converts](Graphic team to create a diagram like this /MA1GLE79QkmcexZSNq04ng)

The initial text we INPUT– For Amazon Polly to return an audio stream you first need to provide the text that you want to synthesize. The input text can either be as plain text or in Speech Synthesis Markup Language (SSML) format. When you specify the SSML, you get the ability to control various aspects of speech, like pronunciation, volume, pitch, and speech rate.

The available voices with Amazon Polly – With a wide portfolio of languages along with a variety of voices, including a bilingual voice (for both English and Hindi), Amazon Polly allows you to choose the audio voice exactly tebway it should be. You even get the selection to be made between a male or a female voice making the audio more relevant to your product. Once you launch a speech synthesis task, you specify the voice ID, and then Amazon Polly uses this voice to convert the text to speech.

We need to note that Amazon Polly is not a translation service that the synthesized speech is in the same language as the text. However, when the text is in a different language than the designated for the voice ( like the text is in german and Voice needs to be in Hindi), or the numbers are represented as digits (like 53, and not fifty-three) then these are synthesized in the language of the voice and not the text.

The OUTPUT format we get– With Amazon Polly, you can deliver the synthesized speech in various formats without restricting yourself to a simple format which many other tools might not give flexibility to its users. You get the flexibility with Amazon Polly to select the audio format that suits your needs. When you request the speech in the MP3 or Ogg Vorbis format for consumption by web and mobile applications or even you might request the PCM(Pulse-code modulation) output format for consumption by AWS IoT devices and telephony solutions.

Use Cases of Amazon Polly

Now that we learned about some benefits and features of Amazon Polly, let us get started with exploring the use case of Amazon Polly offers:

Online Learning With Amazon Polly, you get an enhanced visual experience as the developers get the flexibility to provide in their applications which can be found in speech-synchronized facial animation or karaoke-style word highlighting. This way Amazon Polly makes it easy to request an additional stream of metadata with information about specific sentences, words, and sounds that get pronounced in the video streaming. This metadata stream along with the synthesized speech audio stream gives its consumers the ability to even animate avatars and highlight text synchronously as it gets spoken text on their application. For Example, Voice out speech and highlight the spoken text

Below diagram of how Amazon Polly supports Online Learning: AMAZON POLLY ONLINE LEARNING

Content creation We know that audio is widely used nowadays as a complementary medium for written and/or visual communication. With Amazon Polly, you can up your content creation game, by voicing your content which helps you to provide your audience, with an alternative way to consume their content(information) and helps to spread the content to a larger pool of readers as well. You also get the capability to generate speech in dozens of languages, which makes it easy to add speech to the applications with its global audience, like feeds, websites, or videos. For Example, You can convert the website blogs to speech and download them as MP3.

Below diagram of how Amazon Polly supports Content Creation: AMAZON POLLY CONTENT CREATION

Telephony Amazon Polly enables the contact centers to engage with the consumers in natural-sounding voices. This can also be cached and replayed with Amazon Polly’s speech output which prompts the caller with its interactive voice response (IVR) systems, like Amazon Connect. By additionally leveraging Amazon Polly’s API you get the flexibility to deliver automated real-time information like service status, account and billing inquiries, addresses, and contact information. For Example, for telephony systems, you can enable Text-to-speech.

Below diagram of how Amazon Polly supports Telephony: AMAZON POLLY TELEPHONY

Amazon Polly Pricing

When Pricing for Amazon Polly is considered then you only need to pay for what you use. This means that you get charged based on the number of characters of text that you eventually convert (either to speech or to Speech Marks metadata). You also get the capability to cache and replay Amazon Polly’s generated speech at no additional cost. So you can easily get started with the Amazon Polly Free Tier today.

PAY-AS-YOU-GO MODEL

As you get billed monthly for the number of characters of text that were processed, Amazon Polly’s Standard voices are priced at KaTeX parse error: Expected 'EOF', got '’' at position 121: …he Amazon Polly’̲s Neural voices…16.00` per 1 million characters for speech or Speech Marks requested (once you are outside the free tier).

MILLIONS OF CHARACTERS PER MONTH The free tier for Amazon Polly’s Standard voices includes 5 million characters per month for speech or Speech Marks requests, for the first 12 months, which starts with your first request for speech whereas, for Neural voices, the free tier includes 1 million characters per month for speech or Speech Marks requests, for the first 12 months, starting from your first request for speech.

Below are a few listed pricing examples that you get charged once you start using Amazon Polly:

AMAZON POLLY PRICING

Companies Using Amazon Polly

Now as we understand AWS Polly and its features, benefits, and pricing options, We shall be looking at which companies have started using AWS Polly and are already using its advantage to convert text to speech and make its content engage with its customers.

The below image refers to some of the talked about companies that have been using the AWS Polly and unleashing its benefits: [Add all company's logo-- Graphical team ]

The Washington Post: As The Washington Post is now set its commitment to audio and experimenting rapidly they have sorted its integration with Amazon Polly within the publishing ecosystem. This big step offers its readers a powerful convenience feature at scale while making sure that a high-quality and consistent audio experience across all our platforms for its subscribers and readers is maintained.

Twilio: Twilio added more than 50 voices, 25 languages, and new APIs by integrating with Amazon Polly to give their developers more control over synthesized speech output in their Programmable Voice applications. With great benefits like natural sounding, human-like voices, and customization via SSML along with the necessary reliability to run a system at scale, all major possibilities were met that Twilio was trying to solve. Now Twilio customers are able to increase automation while maintaining or increasing their own customer engagement.

PolicyBazaar: To advance its growth, Policybazaar integrated its service with Amazon Polly for their in-house IVR calling service, and PBee Connect for voice broadcasting, critical voice alerts, and inbound calls.

BeyondWords: With Amazon Polly, beyond words has consistently achieved its high standards and interoperability by building its initial product using Amazon Polly voices exclusively which helped them remain a popular choice among its customers.

Duolingo: With use cases where accurate pronunciation is more important Duolingo has unleashed the integration of its platform with Amazon Polly voices that are not just high in quality, but as good as natural human speech for teaching a language to its customers giving them more engaging content and helping to grow their business.

GoAnimate: GoAnimate now has the ability to immediately give voice to the characters they animate using Amazon Polly integration with their platform. This helped GoAnimate in scenarios where live-human voice-over is either resource or time prohibitive. For example, scenarios were developing a video in many languages or within pre-production to speed the approval process.

Vodafone New Zealand: Vodafone launched its Amazon Polly’s new Kiwi voice within their call center, which now answers millions of customer calls monthly. This integration with a unique New Zealand identity into their customer service channels helped them expand their possibilities.

VMware: With Amazon Polly enablement, now reach differently abled partners and customers by programmatically creating the content that helps us in realizing great benefits that VMware provides through inclusivity and enablement.”

Mapbox: Mapbox is leveraging Amazon Polly’s Text-to-Speech service which offers natural-sounding pronunciation with highly intelligible and pleasant voices in consumers' preferred languages. for its navigation solution that has a prime objective of providing clear and well-understood voice guidance is critical to the user experience.

National Australia Bank: the National Australia Bank is looking for the services of Amazon Connect to help it improve the experience of its customers when they contact its call centers. For the same they are also unleashing the benefits that Amazon Polly offers as using Amazon Polly Brand Voice feels both uniquely NAB and consistent with its position and what their customers want to experience when they call us. National Australia Bank uses this voice-first digital innovation which has helped its customers interact with the voice and experience the engagement that they want its customers to experience.

Conclusion

Some key takeaway points from the article are as below:

The AWS Polly is a service that turns text into lifelike speech, allowing you to create applications that talk and build entirely new categories of speech-enabled products.
With its dozens of lifelike voices across various languages, we can build speech-enabled applications that can work across various countries.
The Pricing for Amazon Polly is that you only need to pay for what you use that is, you get charged based on the number of characters of text that you eventually convert (either to speech or to Speech Marks metadata) along with getting the capability to cache and replay the Amazon Polly’s generated speech at no additional cost.
With Amazon Polly, you can modify the voices that best suit your needs. As Amazon Polly supports lexicons and SSML tags this enables you to control aspects of speech, like pitch, speed rate, volume, pronunciation, etc.
With unlimited replays of the already generated speeches that too at no additional fees, allow you to convert, store and redistribute your speeches for your content helping it reach a wider audience.
Amazon Polly provides support with all the programming languages included in the AWS SDK like Java, Ruby, Go, Python, C++, Node.js, .NET, and PHP) along with AWS Mobile SDK (iOS/Android).