Operator 📞 - Hackathon

Feb 15, 2024

Treading the fine line.

Everyone hates those robocalls you get on your personal phones, but would it be just as bad if we called businesses?

Google Duplex seemed to think so, after their demo that amazed everyone, they canned the product. I wondered however, if it’s fair for me to spend an hour on the phone with Telstra’s robots, wouldn’t it be fine if they were on the phone with my robot?

We discussed some moral, ethical, and legal issues with this, but concluded that it can’t be that bad and we didn’t need to think about legal issues yet. It’s easier to ask for forgiveness, than to ask for permission.

No code sample for this one… but we still got the slide deck!

What is it?

Businesses that rely on variables pricing schemes where most customers pay the ‘sucker’ rate are weak to customers that spend the time researching for a better deal. Our idea was to sell a service where everyone has all the research to ask for a better deal, and take it a step further, where you don’t even have to know the information at all… the service negotiates for you.

For this to work, we decided we needed to become the customer. In mid 2022 when we did this project, voice models were on the up and up, and we wanted in.

I devised a business model; customers would sign up with us, give us some money, give us permission to act on their behalf, collect their super-secret personal information, collect a long sample of their voice. Using this, we’d clone their voice to a TTS model and have it speak to the worker on the other side of the phone.

We had the information, we had a goal, we had the data, we had a voice, we knew what we needed to do.

We realized that mortgages were a prime target after hearing and reading anecdotal evidence of people calling their bank and just asking for their rates to be lowered. In many cases, this would just work, and the customers would receive a better rate. We toyed with the idea of more advanced negotiation techniques, such as asking for discharge forms, or mentioning other banks having lower rates, but for our POC we decided that a simple tool would be best to prove functionality.

How was it made?

The main tech challenges we faced were arguably pretty simple in theory, and separately, had been done before;

For our MVP / demo we needed to do the following.

Phone requirements

be able to call the bank
listen to the other side of the phone
interpret what they were saying
come up with a response
say it back
conclude and hang up

The primary loop being; listen, interpret, generate, say.

Voice requirements

Clone our customers voice
Inference faster than realtime

Once we thought about that for a few minutes, with no knowledge about anything to do with phone call technology, we jumped straight into it.

Phone

It was clear we needed programmatic access to a phone call. I took this on and looked into SIP and phone trunking software, but found it too obtuse and it seemed most of it was more enterprise phone management for call centers. Soon after, I found Twilio and decided that was our best bet. Twilio offers API access for phone call management, including bi-directional audio streaming which fit perfectly with what I wanted. At the time, the docs were a bit lacking, and it was quite difficult to get what I wanted out of it, but with a lot of tenacity I had a Python script that would call a phone number, and play some audio.

The hardest component of this was audio streaming, not helped at all by the sub-par documentation (since been upgraded). I hadn’t worked with websockets before, and this was a big leap. After a few problems in Python I ported the calling script over to NodeJS, given its better handling and wider support of websockets and streaming. The structure and order of the Twilio requests and sockets were something I had trouble with, so I’ve made a sequence diagram of the whole processes further down the page.

The audio coming from the other end of the phone was streamed through another websocket to Google Cloud’s Speech to Text and the resulting transcript I ran through a really simple Interactive Voice Response (IVR) system I created. A fun problem I ran into was that each successive word recognized would result in the whole sentence being returned. This had upsides and downsides, the system was really responsive as it picked up on words as they were said, but if words were repeated at all it would get a little confused. Rather than spending the time upgrading the admittedly weak IVR, I actually had to complete a LeetCode style problem in the form of a string-based Right Outer Join.


// right_outer_join("hello my name is", "my name is John") === "John"

/*
Iterate backwards over the incoming transcript, building the missing words.
this could be a lot cleaner, but its 1:22am...
*/
function right_outer_join(entire_transcript, incoming_sentence) {
    let result = ""
    /*
     in iteration order: 
     result = 
     1. "n" 
     2. Prepend "h" 
     3. Prepend "o" 
     4. Prepend "J" 
     5. Return 
     */
    let incoming_sentence_current = incoming_sentence
    while (!entire_transcript.includes(incoming_sentence_current)) {
        let i = incoming_sentence_current.length - 1
        result = incoming_sentence_current.charAt(i) + result
        incoming_sentence_current = incoming_sentence_current.substring(0, i)
    }
    return result
}

In the following excerpt from the IVR configuration, the system waits until it hears the keywords “customer” and “number”, responds with some natural speech, then repeats the users customer identification number. You’ll hear the audio later in the post.

        {
            "Key": ["Customer", "number"],
            "Action": "wait",
            "Arg": "0.2"
        }
        {
            "Key": [],
            "Action": "say",
            "Arg": "Sure let me look that up",
            "URL": "./audio/sure_let_me_look_that_up.wav"
        }
        {
            "Key": [],
            "Action": "wait",
            "Arg": "1.3"
        }
        {
            "Key": [],
            "Action": "say",
            "Arg": "yep, my number is %CUSTOMER_NUMBER%",
            "URL": "./audio/my_client_number.wav"
        }

I made this as simple as possible, accounting for only a few actions; waiting, pressing a key, waiting for a specific word, and responding with a wav file. I put some effort towards graph-traversal and state recognition, but for a demo we only needed a happy path to show off.

Voice

For the voice component, my teammate looked into voice-cloning models but the training times were too long and running them was expensive. We knew it was certainly possible to do, but it wasn’t within the scope of the 48 hours we had for this project. At this point, we’re aiming for a conversation at least, but he found that inference was too slow on our current resources. Quickly, we changed our tune and modified our plan.

Voice requirements

~~Clone our customers voice~~
~~Inference faster than realtime~~
Pregenerate voice samples to play

With tortoise-tts (using the Tom Hanks voice, naturally) he generated a series of audio snippets that we expected a bank to ask questions about. The quality was questionable, but it perfectly conveyed our intentions of the design.

Tom Hanks’ Client Number

The Demo

I don’t think the video was ever saved, but we had a perfect demonstration live on stage in front of the judges. We handed them one of our phones, and gave them a copy of the ‘script’, and told them they were now a Customer Service Rep at ANZ. Laptop plugged into the main screen, quick enter in the vscode terminal, and the phone started to ring.

You’ve made it this far, have a look at the slide deck!

Final output

I wish the Twilio docs had one of these…

sequenceDiagram
    participant S as Server
    participant T as Twilio
    actor R as Recipient

    S->>T: Initiating Phone Call -> Recipient <br> Callback: local.server/call_instructions
    T->>S: Requests local.server/call_instructions
    S->>T: Contents of local.server/call_instructions
    Note over S,T: call_instructions: <br> Say: Hello World! <br> StreamURL: wss://local.server/stream
    T->>R: Initiates call
    R->>T: Accepts call

    T->>R: TTS: "Hello World!"
    T->>S: Initiate websocket connection
    loop Until phone call is complete
        R->>S: Incoming audio via Websocket
        S-->>S: Get transcript via GCP
        Note right of S: Using Google Cloud Speech to Text
        S-->>S: Apply ruleset <br> Retrieve audio response
        S->>R: Audio stream via Websocket
    end
    S->>T: End call
    T->>R: End call