OpenAI 实时 WebRTC API 非官方指南
The Unofficial Guide to OpenAI Realtime WebRTC API

原始链接: https://webrtchacks.com/the-unofficial-guide-to-openai-realtime-webrtc-api/

WebRTC Hacks的Chad Hart详细介绍了他如何将OpenAI实时API适配到树莓派上,并使用旧的Google AIY语音套件运行的经验。该项目利用WebRTC进行音频通信,取代了Dialogflow。他提供了一个分步指南,重点介绍了关键方面,例如使用`getUserMedia`捕获麦克风音频,设置`RTCPeerConnection`以及管理用于事件交换的数据通道。 他使用了“隐式setLocalDescription风格”,并使用HTTP Post方法在请求体中发送本地描述SDP和API令牌。文章强调了监听`connectionstatechange`事件以确保稳定连接的重要性。文章解释了如何使用数据通道消息与实时API交互,包括用于设置会话参数的`session.update`和用于初始问候的`response.create`。函数调用能够执行诸如结束会话之类的操作,并包含有关处理`response.function_call_arguments.done`等事件的说明。

Hacker News 最新 | 往期 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 非官方OpenAI实时WebRTC API指南 (webrtchacks.com) 14 分,来自 feross,2 小时前 | 隐藏 | 往期 | 收藏 | 1 评论 Sean-Der 16 分钟前 [–] 如果有人对 WebRTC + 实时 API 有任何问题,我很乐意回答。我尤其对它的物联网/嵌入式方面感到非常兴奋 :) 你可以在这里看到它的视频演示[0]。我认为在微控制器上的代码非常容易上手 [1][0] https://youtu.be/14leJ1fg4Pw?t=824[1] https://github.com/espressif/esp-webrtc-solution 回复 加入我们 6 月 16-17 日在旧金山举办的 AI 初创公司学校! 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们 搜索:
相关文章
  • (评论) 2024-05-15
  • (评论) 2024-07-14
  • OpenVoice:多功能即时语音克隆 2024-03-30
  • (评论) 2024-05-15
  • (评论) 2024-02-23

  • 原文

    OpenAI using WebRTC in its Realtime API is obviously exciting to us here at webrtcHacks. Fippo and I were doing some blackboxing of this based on a quick sample the day of the WebRTC announcement so we could look at it in webrtc-internals and Wireshark. Some weeks later, my daughter was interested in using ChatGPT voice mode. That reminded me of the many old cardboard Google AIY Voice Kits I made with her. Could these be repurposed to use the new OpenAI Realtime API instead of Dialogflow?

    Adapting that OpenAI Realtime WebRTC sample to run on a Raspberry Pi was actually pretty simple. However, as I got beyond the basic “say hello to the bot”, I noticed the documentation for using the Realtime API with WebRTC and Data Channels was not great. My initial attempt to vibe code something quick failed despite my Retreival-Augmented Generation (RAG) attempts with OpenAI’s Realtime docs. This failure forced me to do some old-school thinking and experimentation for myself. I ended up coding this the traditional way with logs, network inspection, and trial and error. That ultimately led me to this post to document it all.

    In this post, I will:

    • Provide a step-by-step guide for using the OpenAI Realtime API with WebRTC
    • Highlight some best practices for working with WebRTC along the way
    • Comment on the messages and flows
    • Highlight how to use functions with the Realtime API’s data channel

    Many thanks to pion founder and past webrtcHacks interviewee, Sean DuBois of OpenAI for his review and suggestions! I will be asking him to chat more about this.

    If you want to see what this is before reading, try it below. You must click the Settings button and paste your OpenAI key for this to work. (ToDo: better warnings in the app on that).

    This app is just an iFrame hosted from the webrtcHacks/aiy-chatgpt-webrtc, but if you are not comfortable entering your credentials here, then clone the repo or copy & paste the code from the source into a HTML file and open it..

    You can see my final example code on the repo.

    Some notes on this before we begin:

    • It is a single file example – just HTML with some vanilla JavaScript inside <script> tags
    • This is a simple, but realistic example
    • I broke this into 2 <script> tags
      • We’ll focus on the first with the WebRTC, API setup, and messaging logic
      • The 2nd is mostly UI logic – we won’t cover that much

    The Flow

    Here is an overview of the general flow:

    As with most things in JavaScript, most of the code is asynchronous so the exact timing and ordering may vary. In addition, this does not describe the many things WebRTC does for you behind the scenes. Stay tuned for more details on the WebRTC implementation from Fippo soon.

    Getting the Microphone Audio

    We are using the Realtime API with WebRTC instead of text-only methods to send the user’s speech and get audio back. We need to use the getUserMedia API to grab the user’s audio stream from the user’s microphone. Since Media is critical for our use case, it is best to do this early.

    getUserMedia takes a constraints object as a parameter. The {audio:true} tells chatGPT we just want to capture (adding video: true would also capture video, but that is not supported (yet?). We are keeping this simple here, but there are a variety of options you could put here. Generally, you should use the defaults.

    This will use the default microphone option configured in your browser settings. There is an enumerateDevices API that will list available devices and make those available for selection. If you go down that path, then you should also listen for device changes.

    Also keep in mind, the OS and browser usually both require permissions to share the microphone. I did not do this, but you can check these permissions even before the user starts the session with the Permissions API. If we had a very short TTL on our token, then I would have done the media capture first, since the user could take some time to give the proper permissions.

    The Media Capture and Streams spec puts the captured audio inside a stream object. That stream object can contain one or more track objects. We only need to use the track later. Unless the user has some kind of advanced audio setup, the stream should only contain a single track so we specify the first one in: [track] = stream.getAudioTracks();

    WebRTC Setup

    At this point we should have our ephemeral token and an audio stream from the microphone.

    RTCPeerConnection

    The major WebRTC API to connect to OpenAI is:

    Adding audio streams

    Next we hook up our audio:

    The ontrack event handler will connect the audio from OpenAI to our <audio> element. I chose to display the Audio Element in the HTML so you can mute and adjust its volume.

    Note: It is common to show a volume visualization to give the user an indication their mic is working. See this WebRTC sample for an idea of how to do that. In our case, we display the audio transcription, which helps serve that function.

    addTrack adds the local microphone audio to the RTCPeerConnection. Including the stream helps the peer connection maintain the track association with the stream, which is a good practice.

    Data Channel Setup

    WebRTC includes a bidirectional communication mechanism known as the Data Channel:

    At a high level, this functions similarly to a WebSocket but in a peer-to-peer fashion. The data channel runs along with WebRTC’s media streams. We need this to send and get events from the Realtime API. The “oai” label is optional and can be anything.

    Here we will also specify some setup interactions once the DataChannel is open:

    I will discuss that in the next section. We also have a handler for incoming data channel messages:

    I will cover the messages and interactions in the Realtime API Message Exchange section.

    WebRTC Offer / Answer Process

    WebRTC has robust connection establishment mechanisms that navigate firewall/NAT restrictions and media codec compatibility. We don’t need to worry about those details here, but this process does require a few steps to get going.

    First we need to create a local description:

    This method builds connectivity options and media details known as an “offer”. WebRTC sends the offer to the remote party – also known as the “peer” – behind the scenes. Unlike the official OpenAI references, I use the preferred “implicit setLocalDescription style” here (the offer is created automatically).

    In WebRTC, it is up to the developer to figure out how to send that offer to the other end and send back the response. OpenAI does that via their https://api.openai.com/v1/realtime endpoint. We send the local description SDP and the API token in the body.

    Sidenote: this Offer/Answer exchange is similar in concept to the IETF’s WebRTC HTTP Ingress Protocol (WHIP) (video on this).

    The API endpoint will then respond with something like this:

    That is the archaic session negotiation format known as SDP. The SDP defines available networks, media codecs, and datachannel parameters. You shouldn’t have to worry about this, but if you are curious about the full gory details, see our SDP Guide.

    Next you need to pass this to the peer connection’s setLocalDescription method:

    That should create the bidirectional WebRTC connection with audio flowing in and out the peer connection with ontrack and data channel open events firing. The OpenAI docs don’t mention this, but to avoid asynchronous timing issues, you should listen for the peer connection’s connectionstatechange event to switch to connected before proceeding with any logic that assumes you are fully connected.

    You do not need an ephemeral token

    The OpenAI docs I saw only referenced getting an ephemeral token to authenticate the client. Ephemeral tokens make sense in most client-server apps when you want your server to authenticate and give temporary access to connected clients. In my case, there is no server since it is all client-side. I started out generating an ephemeral token (reference code for that here) because I thought it was the only way.

    Sean DuBois at OpenAI pointed out that this isn’t necessary. You can use your OpenAI API Key to authenticate directly from the client, as you can see in the fetch code above.

    But what if this was a more normal client-server application? In that case, you could send the pc.localDescription.sdp object to your server and have your server fetch the answer (the await fetch(`${baseUrl}?model=${model}`, {method: "POST", body: pc.localDescription.sdp, headers: { Authorization: `Bearer ${apiKey}`,"Content-Type": "application/sdp" },});  ). Then it could pass this back to the client. Depending on timing needs and your server’s responsiveness, this approach could make sense in many situations.

    Also, consider the Time To Live (TTL) between the ephemeral and direct API token approaches. I found the expiration time is always 30 minutes from the creation time when passing the API Key directly. I did not see a way to change this. I got a 2-hour Time To Live (TTL) when I experimented with ephemeral tokens. That is also hardcoded to 2 hours with no option to set expires_at. 2 hours seems long and is a common complaint on OpenAI’s message board.

    To Do: see what happens to the live session when these tokens expire.

    Finally note, the ephemeral token option does let you establish session parameters before you do anything with WebRTC. As I will show in the next section, I chose to do that later via a session.update do demonstrate now non-user, system instructions and other parameters can be updated at any time.

    Realtime API Message Exchange over the Data Channel

    This is not explicitly stated anywhere in the docs, but the messages sent back and forth over the data channel are the same as the WebSocket-based Realtime API. In fact, Sean at OpenAI confirmed there is a pion-based server effectively proxying the data channel messages to the Realtime WebSocket server internally.

    I deferred the session setup messages until the data channel was open. Let’s look at the sessionStartMessages method first.

    session.update & available session parameters

    As mentioned in the previous section, I chose to defer setting up the session until after the WebRTC connection was established.

    Here we start by sending a system message using type session.update:

    Let’s break down the session object we send:

      • instructions – the session instructions taken from settings that guide the default model behavior – i.e., “You are a friendly assistant to a middle school girl. Speak in a slightly enthusiastic tone. Occasionally use gen alpha language and irony. Your knowledge cut-off is October 2023.”
      • voice – which speech synthesis to use from the options listed here
      • temperature – how random the responses should be on a range of 0 to 2 where zero is deterministic (you get the same output everytime) and 2 is entirely random
      • input_audio_transcription – the realtime model takes speech directly as an input without converting it to text first. To provide the text, OpenAI uses its Whisper speech-to-text engine to convert the speech input. I have noticed this transcription can differ from what the model seems to interpret. You can specify a language as a parameter for increased accuracy if you don’t care about multi-lingual support. You can also give it a prompt to help improve the output. I didn’t play with this, but I imagine adding proper spellings of names and additional context details would help with the Word Error Rate (WER).
    • tools – I’ll cover functions in the next heading below
    • tool_choice – this goes along with the above

    Note you can’t change the voice once synthesis starts.

    Examining session defaults

    The session.created event (covered in Incoming Events) returns the default session parameters as a JSON object:

    You can set all of these. I will cover the tools option in the Using a function call to respond before ending a session section.

    Don’t touch the audio formats

    Also note, input_audio_format and ouput_audio_format don’t change the codec used by the browser so you don’t need to set this. Fortunately, the WebRTC API uses Opus, which is fortunate because it supports wideband audio and has nice features like Forward Error Correction (FEC) by default in browsers for improved audio quality. The pcm16 (or g711_ulaw/g711_alaw) selection is only used by the internal media proxy.

    Turn detection

    One piece I did not include here is turn detection – i.e., how sensitive the API is to interruptions from the user and background noise. You can see the parameters for this here.

    WebRTC includes some basic noise suppression by default and uses its own Voice Activity Detection (VAD) to minimize bandwidth during silence. The Realtime API needs to detect when the user is speaking and how quickly to interrupt. The defaults worked fine for me in a reasonably quiet environment – both on the browser and in my Raspberry PI AIY Voice Kit hardware. I see where a device manufacturer might want to tune some of these parameters.

    Third-party noise suppression libraries like RNNoise could help clean noisy input. I am curious if that would make much of a difference on the LLM’s ability to understand the input. If you were doing this server-side, one could even use Silero VAD to control turn-detection manually.

    Start Instruction via response.create

    I wanted my agent to say a greeting right after the user clicked the start session button. I found the most reliable way to do this is to use response.create right when the session starts:

    My response object contained:

    • instructions – guidance on what to say to the user when starting – i.e. “Ask Chad by name how you can help.”
    • modalities – I always set this to text and audio to output both
    • max_output_tokens – I wanted to keep this relatively short, so I set it to 100 tokens

    Note: I could have included an event_id in the startMessage if I wanted to track incoming events related to that specific ID specifically. You can do this on every message you send.

    Incoming Events

    I have a switch case in my handleMessage method for the following incoming events:

    Event Message Description
    session.created A new session has been created
    input_audio_buffer.speech_started Indicates that the user has started speaking.
    conversation.item.   input_audio_transcription.completed Fired when the transcription of user audio input is complete with the final transcript of what the user said.
    response.audio_transcript.delta Sent while the assistant is transcribing its own audio output, providing interim transcripts of the AI’s speech before the full response is complete.
    response.audio_transcript.done Sent when the assistant’s speech transcription is complete. It contains the final full transcript of what the AI said.
    response.function_call_arguments.done Indicates that the assistant has finished processing function call arguments. If the function name is “end_session”, it suggests ending the session.

    All the other events directly update the UI with a transcript of the dialog except session.created and response.function_call_arguments.done which I will cover next. I only included session.created to examine the default session options. I’ll cover functional calls later.

    Other events

    I also ignore several events I didn’t need in my UI but that could be used in other implementations:

    Event Message Description
    response.output_item.added Returned when a new output item (e.g. a message or function call) is created during the assistant’s response generation.
    conversation.item.created Returned when a new conversation item is created (for example, when adding a user/assistant message or function call to the conversation).
    response.content_part.added Returned when adding a new content part (e.g. a text segment or audio chunk) to an assistant message during response generation.
    output_audio_buffer.started Indicates the assistant’s audio output has started streaming – in other words, the voice response begins playing.
    response.audio.done Returned when the audio generation for a response is complete. This event fires when the model’s audio output is finished (including if the output was interrupted or canceled).
    response.content_part.done Returned after the delivery of a content part (text, audio, etc.).
    response.output_item.done Returned after an output item has finished streaming, signalling that a particular message or function-call item in the response was fully delivered. This event is also emitted if the response is interrupted or incomplete.
    response.done Returned when a response is completely done streaming. This event marks the end of the assistant’s response and is always emitted, regardless of whether the response finished normally or stopped early.
    input_audio_buffer. speech_stopped Returned (in server voice-activity-detection mode) when the server detects the end of the user’s speech input. In short, it signals that the user has stopped talking and the audio input has ended.
    input_audio_buffer.committed Returned when the input audio buffer is committed, either by the client or automatically via VAD. It means the user’s audio input has been finalized and converted into a new user message for processing.
    response.created Returned when a new response is created, marking the start of response generation. This is the first event of a response (initial state set to in_progress).
    output_audio_buffer.cleared Returned when the client clears the output audio buffer, stopping any ongoing assistant speech. This happens in response to an output_audio_buffer.clear request from the client (effectively dropping any buffered audio).
    conversation.item.truncated Returned when the client truncates (cuts off) an earlier assistant audio message using a conversation.item.truncate event. It synchronizes the server’s conversation state with the interrupted audio (removing unheard audio from context).
    rate_limits.updated Emitted at the start of a response to provide updated rate-limit information. It shows the updated usage limits (e.g. token quotas remaining and reset timing) after reserving tokens for the response.
    output_audio_buffer.stopped Indicates that the assistant’s audio output has stopped, signaling that the voice response has ended (the assistant has finished speaking).

    Using a function call to respond before ending a session

    I wanted to add a way to let the user verbally stop the session. OpenAI allows “function calling” which reminds me more of older-school intent-based agent models like Dialogflow and AWS Lex.

    I set up my function call as part of my first system message. This looks like:

    When the model detects something that looks like what is described there, then a response.function_call_arguments.done event is returned. The function response can include parameters, which I did not need here. Function calling works the same as described in the documentation, except the messages are sent over the data channel.

    As described in the previous section, if response.function_call_arguments.done matches “end_session”, I call the endSession() function:

    endSession() checks to see if the data channel is still open. If it is, it sends a response.create to generate a good-bye message and then listens for the next output_audio_buffer.stopped before closing the session.

    track.stop() ends the microphone capture. pc.close() closes the peer connection, stopping transmission and closing the data channel.

    When the user clicks the end session button, I send a response.cancel to stop any in-progress responses. This logic is unnecessary for speech input because saying something while the AI is speaking will trigger turn detection and stop the speech output.

    When I started this project I wanted to do everything in Python. Then I discovered I couldn’t get recent versions of anything to work with the AIY Voice Kit hardware and associated code that was archived 4 years ago. It is easy to load a browser on a Raspberry Pi and that has the advantage of being portable to any web browser, so I went down single-html-file path described here. I am still working on cleaning this up, but you can see my AIY Voice Kit chatGPT Realtime code with some hardware interactions in the same repo on the aiy_voice_kit branch.

    Here is a demo of that:

     

    You don’t need to keep the Raspberry Pi desktop open, but it is helpful to see the webpage for debugging. 

    You actually don’t need to load a whole browser to use WebRTC. I could have possibly used a browser-API solution like Puppeteer or Playwright – or Python clones of those (i.e. playwright-python) – to load a headless browser and avoid the heavy overhead of the Raspbian X11 desktop environment.

    In fact, you don’t need to load a browser at all. You could use something like aiortc in Python or pion for Go-lang instead of the browser’s JavaScript-based RTCPeerConnection. There is also OpenAI’s Realtime Embedded SDK which works on Linux and (according to the repo) has been tested on tiny and cheap ESP32-S3 embedded devices. That SDK uses WebRTC from libpeer. Future project: try that SDK on this ESP32 device I already have from Seeed.

    It is exciting to see WebRTC driving into even more places and in new, mainstream applications like the Realtime API. RTC continues to infect everything the web touches!

    {“author”: “chad hart“}

    联系我们 contact @ memedata.com