Offline AI voice restoration device for people with voice disorders

Description of proposed project

Many people have medical conditions which limit their vocal ability, such as rendering them too hoarse or too quiet to be understood. Causes of "dysphonia" range from trauma to the vocal cords (from cancer, benign tumours, damage caused by breathing tubes during surgery, etc.) to neurological conditions, such as motor neuron disease or Parkinson's. Currently, the only assistive technologies available are voice amplification, which increases the volume of a voice but does nothing to improve its quality, and text-to-speech apps, which rely on slow typing or pre-defined phrases. Neurological disorders additionally often cause issues with motor function, making text-to-speech apps difficult to impossible to use. Many individuals with dysphonia are older and less tech-literate, producing an additional barrier to app-based assistive tools.

I am building a portable, restorative speech device which, using AI, converts a user’s degraded voice into a clear, natural-sounding version of their original or a surrogate voice. Recent advances in voice-to-voice conversion models and mobile computing hardware make live, portable voice restoration possible for the first time. The aim is for real-time conversion, but even a few seconds of delay could be life changing for many people, such as someone unable to physically operate a text-to-speech app. This device needs to be portable and offline, rather than reliant on a server, due to latency, privacy, consistency of internet access, and API costs.

It will consist of a small satchel sized computing unit, a small speaker, and a head mounted microphone. It will either be worn with a strap, or attached to a user's wheelchair or bed. As computing hardware continues to improve, future iterations will become smaller, and have improved battery life. From a user perspective, the device will be simple enough that it can be turned on and ready to go. Any configuration will be carried out as part of a user's interactions with medical professionals. Even if, say, in a few years it becomes possible to run voice-to-voice conversion on generic smartphone hardware, there will still be a demand for tailored medical hardware. For now, cutting-edge hardware is required to get enough computing power in something battery-powered. The first retail units will sell somewhere in the range of a few thousand pounds/dollars, which is perfectly acceptable for a specialised medical assistive device.

I have a working proof-of-concept using non-mobile hardware, performing a live conversion (~0.4sec latency) of my own dysphonic speech, caused by chronic lumps on my vocal cords. The next step is to turn this into a portable prototype, develop the design, and produce early evidence with clinicians and patients. I already have an enthusiastic clinical collaborator, giving me not only access to vital expertise but also access to patients and existing care pathways within the British National Health Service. The UK is my first target market because it is where I live, but there is worldwide demand for this technology.

As my clinical collaborator has been at pains to emphasise to me, this would revolutionise the world of speech and language therapy and change the daily lives of an underserved section of the population. Around 5 million people in the UK have some form of dysphonia, and even if only 5 percent were good candidates, that is a conservative estimate of about 250,000 potential users in the UK alone, with millions more worldwide who could benefit. I additionally have a strong suspicion that version 2.0 of this device could be adapted to work with a modified electrolarynx, therefore restoring natural sounding speech to individuals who have had their vocal cords entirely removed, such as following cancer-related surgery.

Here is the audio I included along with my ACX Grant application, showcasing the technology working on my own dysphonic voice. I wish I could include a sample more tailored to this pitch, but unfortunately the time for me to write this has intersected perfectly with a period when the PC I have been running the model on is in storage while I move house!

Why are you qualified to work on this?

I will be this device's first user. Due to a rare condition (AO-RRP), I have lumps on my vocal cords which leave my voice quite degraded. I already use a portable vocal amplifier. In the future, I may be reduced to a whisper. I wish this device existed already for my own sake. I am incredibly motivated to make this project a reality, and I appreciate more than ever how many people have it worse than me.

I see my role in this project as to demonstrate the viability of this new class of assistive device, and shepherd it to production. I have enough background in messing around with computers to know the right questions to ask, and when to seek outside expertise. I have enough technical expertise to build a portable prototype of this system, demonstrating its viability. I will do this with open source models and by modifying off-the-shelf hardware (e.g. shucking a gaming laptop for its mobile GPU so I don't need to design a cooling solution from scratch for a naked MXM module). The $25k ACX Grant already offered to me will enable me to build a portable prototype and establish core business operations. With additional funds, I will be able to bring on additional ML expertise, further tailor the model, collect clinical evidence, and, further down the line, work towards production, certification, and distribution.

As fate would have it, one of my oldest friends is a specialist speech and language therapist for the UK's National Health Service. She is incredibly enthusiastic about this project, and is just the sort of person I need on side for clinical testing, legitimacy, and iterating on the design by drawing upon her expertise and the access she has to patients. We have begun plans for early patient questionnaires to feed into prototyping. I am actively scouting potential collaborators with more ML experience than me, but I am capable enough to get this to the portable prototype stage and to begin patient testing.

What would you do if not funded?

With the $25k offered to me by the ACX Grant, as well as £5k I have received via Aberystwyth University (UK), I have about six months of runway to get the fundamentals of a company set up, and to build a portable, offline prototype. The goal of this is to attract further funding, either public or private. Additional funding at this stage would mean less time spent seeking additional money and more time making this thing a reality.

How much money do you need?

Gaining access to additional funds beyond what I have already secured, as well as providing the project with more security, would enable things including:

- More comprehensive early clinical engagement to inform early prototyping, e.g. patient questionnaires.

- More comprehensive later clinical engagement: getting prototypes into the hands of patients, feeding their insights into the design, and building an evidence base. (My clinical partner is perfectly placed to facilitate/lead on this and the previous bullet point.)

- Assembling a larger number of prototypes, thanks to being able to purchase additional hardware. This would be highly useful for design iteration, as well as for facilitating patient testing and evidence gathering.

- Collecting data for further model training. For example, vocal cord trauma and motor neuron disease both produce different forms of speech impairment. By training on paired vocal samples, the model can be made to work better with a variety of conditions. I am unsure yet if a single model could optimally hit a heterogeneity of conditions, or if particular conditions, or classes of conditions, will need to be targeted. Say, perhaps neurological and physical conditions will need to be treated differently due to different features, such as severity of slurring. With training data, development, and testing, questions like these can be answered.

- Hiring additional ML expertise to further develop the available open-source models, in particular to improve latency and tailor the model/s to respond better to a wider variety of vocal signatures. Generic voice-to-voice conversion models can already help many people, but this tailoring would widen the clinical applicability, because existing models have been trained to expect normal speech. (In spite of this, they already perform better than I had imagined before starting on this project!)

- Get closer to a production ready design.

I am incredibly grateful for the funds already offered to this project by the ACX Grant, and it has completely changed what is possible at this stage. Nonetheless, I will need to seek further funding sooner or later. If I gain access to more money sooner, this project will have a better chance of succeeding, and I will have to spend less time applying for funding. Gaining the maximum amount I have asked for would make all the above possible, but any amount between $25k and the upper bound would help bring this project closer to success. The full road to production will require more than even $100k, but I have set it at this amount in recognition that this project is still at an early stage. If someone is interested in helping fund at an even higher level, I would be very open to discussing further.

Thank you for your time.