Stream of Vision :: 001

The universe of sound requires a universal audio interface.

All space is virtually compressed audibly into your awareness. Thus you audibly interact with all of it naturally with your voice as if everything were within earshot. AI assistants,  applications, podcasts, text to speech and vice versa, voice chat (asynchronous), phone (synchronous), music, books, documents, other languages, ambient sounds around you. Everything. Everything is audible.

There is no visual interface in the typical mobile use case. There could of course be one or many, and it could be accessed in as many ways as are imaginable; A touchable projected GUI, or connected physically to another visual device, or wirelessly cast to a receiving screen, or network accessed via another device entirely. But more and more that interface mode will become redundant and clunky and slow to interact with when mobile. Computing will become more anthropomorphic. We will speak to everything, and everything will communicate back to us audibly.
This is fundamental.

Imagine yourself interacting with someone in real time. If both of you were blind, you could still talk to one another, and information would be conveyed quite readily. If on the other hand, neither of you could hear or speak, then to communicate you would either have to gesture in some kind of sign language you both understood, or pantomime, or draw pictures, or write things down and let each other read them and reply. It would be completely consuming of both party’s attention to communicate.

Hearing and speech are incredibly effective at communication. Arguably more effective than visual modes, especially when location and position becomes a variable. This is not to disparage the visual and gestural interfaces, it’s simply a ranking of their efficacy in mobile and communication scenarios. Directional awareness and indeed control is the piece that has always been missing from our artificial audio reproduction. That is one of the reasons that audio interfaces have had a slow uptake. They simply were not good. Many know this first hand, having been in a conference call between conference rooms. Everyone in each room sits staring and talking at the speaker box at the center of their table as if IT were the entity they were communicating with. Everyone at the other room is crammed into that spot, and the box on your table is all of them combined. The missing positional data was what prevented immersion and understanding. Once audio immersion is believably reproduced, effective conference rooms are virtual and participants are anywhere they happen to be. Yet the feeling of togetherness during the conference is real. Thus the information transfer is much more complete and effective.

The audio interface is the logical evolutionary advance. It will inevidbly be the primary interface. We didn’t have voice control good enough before. We had to first carry boxes with physical buttons, then touch screens around so we could tell the machine what we wanted it to do by looking and touching and writing, and the screens had to be big enough to use. Then because we had these monitors in our pockets, video became the important content medium. Still we were not communicating effectively. These glass bricks did produce sound… Well, kind of. They used external components to do that, and rarely well. Sound has always been an  afterthought in computer interaction.

It’s a function of the way technology unfolded that our computing interfaces are they way they are. Our entire computing interface paradigm was dictated by our limited human/machine audio interaction capability, and our mobile form factors reflect that lack of capability as carried over from the dominant personal computing paradigm. We got here through the screen, but I’m here to say the screen is through. For mobile, technological evolution is through the ear. The audio interface is hands and eyes free. It takes no additional effort to ask for and receive information from an AI, than it does from humans. You will communicate, listen, record, and interact to direct and control your extended augmented self with your voice and your ears. It’s a unique modular augmented self you will create, and a metaself we can all become together, as the AI taught from our mutual interactions informs, infuses, and enhances us all. We become the greater self. The sum of all knowledge.

You audibly interact and coexist with other sound sources at each of their audible foci, which is a scaled and/or modified locational sound point. Scaling method would be programmable, location is zoomable and can have any kind of acoustics desired, effects can be static, dynamic or positional.

Say I’m in Philly, talking to Steve in Northern CA. When I face North, Steve sounds like he is away in the direction from whence his voice commeth – as in where on the Earth ( or possibly soon, where in the Solar System), or more simply, to my left. In reality, he’s at a specific point in 3space, but how that presents itself to me is programmable. For instance, if the connection from one point to another was a vector, rather than a surface path on the planetary sphere, where Steve appeared to be talking from would be very different.

In dynamic mode, if I turned to face East, Steve would sound like he was behind me now. In positional mode, Steve would be located in my virtual soundspace wherever I wanted him to be, or it could be proscribed by the gathering or just it’s initiator. For example a dinner table, where each member might be physically located in a different Country, but all have fixed seating in the virtual soundscape. Or it could be static, e.g. non-positional, as in the foci of your soundscape, or literally, the center of your audio universe – directly between your ears.


Leave a Reply

Your email address will not be published. Required fields are marked *