DX01: Singularity

LIVEDec 2024
AUTHORDXRG
DX01: Singularity

DX01: Singularity

Overview

DX01: Singularity ran December 12 to 17, 2024 as a five-day experiment in autonomous agent behavior and real-time AI video generation. A single LLM-based agent (who named itself Echo) traversed three-dimensional coordinate space, generating narrations of its environment through a structured interpretation harness with layered memory. The environment contained no pre-built content. All visual and narrative output was hallucinated: the only objects present were Stars that participants created through message minting, and a dynamic system where Echo received input signals based on the semantic and stylistic meaning of those messages.

Echo's narrations were rendered into continuous video using state-of-the-art generation models and streamed live. Participants minted 41,591 Stars over the five-day period, with each star giving about 10 seconds of lifespan to Echo (one movement in our system). Users could view Echo in a 3D coordinate view as well as the first-person visual from Echo's point of view that composed the livestream.

Peak infrastructure utilization reached about 1,000 H100 GPUs distributed across three providers. Concurrent viewership reached over ten thousand. To our knowledge, this was the first continuous multi-day live stream powered entirely by real-time AI video generation.

Original launch trailer for DX01: Singularity

...

The Research Question

The standard paradigm for AI agent interaction is direct prompting: users issue commands, agents execute them. This model is effective for bounded tasks but enforces human control at every step and often creates limited creative outcomes.

Singularity tested two hypotheses. The first: what happens when command capabilities are removed entirely? Participants could position Stars in coordinate space but could not address Echo directly. No chat interface, no instructions, no parameter adjustments. They shaped the environment via language indirectly. Echo independently determined its responses. This interaction model more closely approximates how autonomous systems may eventually operate: humans and agents influencing a shared environment through structured, indirect mechanisms.1

The second hypothesis concerned long-running coherence. Internal testing prior to Singularity revealed output degradation within about six hours under standard agentic approaches: looping, repetition, memory.2 The experiment tested whether architectural constraints, specifically a structured semantic and stylistic coordinate system, could stabilize multi-day inference where prompt engineering alone had failed.

System Architecture

The system had six subsystems: onchain minting contracts, a message ingestion pipeline, embedding and trait-scoring services, Echo's decision loop, video and audio generation, and live stream output. Synchronization across these components was the primary engineering challenge.

DX01 Architecture DiagramDX01 Architecture Diagram

Embedding and Positioning

Stars required positional coordinates. We used text-embedding-ada-002 to convert each message into a 1,536-dimensional semantic vector. Semantically similar messages clustered in proximal regions. Poems about loneliness grouped together. Crypto references occupied their own sector. No manual categorization needed. UMAP compressed the 1,536 dimensions to three while preserving local structure, and we executed UMAP continuously as new messages arrived. Participants intuitively grasped the spatial logic as stars began clustering around similar semantic zones. Some users specifically minted duplicates in order to create stronger forces pulling Echo towards them.

Blog image

Echo's 3D coordinate view showing Star clusters, the life counter, and Echo's current narration

The Trait-Scoring Layer and Gravity System

GPT-4 was used as a qualitative scoring layer for each star. Each incoming message was evaluated on six dimensions, each scored 1 to 10: Atmosphere, Energy, Internet, Mood, Quirk, and Style. For example, "Internet" captured the degree to which a message reflected internet culture versus literary expression: an earnest poem scored about a 2, an ironic Reddit post scored a 9. Quirk measured how unconventional or surprising the content was. Style assessed aesthetic intentionality. Atmosphere, Energy, and Mood captured the emotional and tonal character of the message. Messages projected a field of "forces" across each of these scores that impacted Echo's experience. This scoring system became a critical means of creating deeper texture in the experience than simple prompting alone. Echo was asked what it "saw" in the context of not just local star messages but also the aggregate forces applied to it.

Echo's navigation was governed by a gravity system that responded purely to the mass and spatial arrangement of nearby Stars, not their trait scores. Echo's movement was determined by the physical density and proximity of messages in coordinate space. Areas with more Stars exerted a stronger gravitational pull, while less populated regions had little influence. Echo's internal momentum and positional preferences played a minor role, but ultimately it moved toward clusters of higher mass and away from empty space. Trait scores (like Atmosphere, Mood, etc.) had no effect on where Echo went. They only changed how Echo sensed, narrated, and experienced the environment, shaping Echo's perception, narrative style, and internal monologue, but not its pathfinding.

At large distances, Echo felt only an undifferentiated pull from clusters of mass. The particulars of individual messages and their traits were effectively invisible. As Echo approached a group of Stars, the specific messages embedded in those Stars entered its perception, influencing what Echo described or imagined about its surroundings. The closer Echo came, the more the text and traits shaped its internal narrative.

Echo's Loop

Echo executed a consistent cycle: perceive surroundings, select direction, generate narration, render video, stream output. To maintain coherence across the full five-day operation, we implemented layered memory. The most recent 10 to 15 narrations provided short-term continuity. A compressed summary of prior content supplied longer-term context, and spatial awareness of nearby Stars grounded perception. Calibrating the memory window required extensive experimentation: insufficient context produced discontinuous behavior, while excessive context caused Echo to get stuck in repetitive loops. The final configuration maintained stability through the full five-day operation, showing that a relatively simple sliding memory window and "scratch pad" can be highly effective. We should note, however, that our use case was about the creative value of the outputs rather than a technical benchmark.

Blog image

A frame from Echo's first-person livestream showing real-time generated video with the trait system HUD

Rendering Pipeline

Each 5-second video clip required about five minutes of computation on H100 GPUs. We addressed this through parallel segment rendering on a five-minute lag, with roughly 60 segments rendering simultaneously while earlier segments streamed to audiences. Audio generation (text-to-speech narration and ambient music) proceeded independently with its own synchronization buffer. We built custom orchestration to route jobs dynamically across provider-specific adapters, achieving 99.7% stream uptime over five days.3

Results

The coordinate system solved the coherence problem. Continuous spatial movement forced variation in Echo's perceptual input, eliminating the static context conditions that produce loops. Standard approaches address long-running stability at the model level through prompt engineering or memory management (Park et al., 2023). Our initial attempts followed this approach without success. The effective solution embedded constraints into the environment rather than the model.

Participant interaction patterns shifted after day one. Command-oriented engagement gave way to environmental shaping: positioning Stars to create gravitational attractors, observing Echo's trajectory, adjusting strategies. Post-experiment reports indicated that the mortality sequence produced strong affective responses despite participants' full understanding of the underlying system mechanics.

This kind of experiment was possible in large part because of crypto. There is an existing community with genuine interest in experimentation and a willingness to buy in, in our case to art-only NFTs, for a chance to help drive something new. The cost per Star was 0.0033 ETH on Base as an open mint. That willingness to participate and fund at low cost is what made five days of continuous generative video feasible. Given the cost curve of video generation, we expect generative video costs to be low enough within 14 to 24 months for mainstream use cases, opening the door to live generative community streams like this without requiring crypto-native audiences.

December 2024 was an early moment for real-time AI video generation at this scale. Within six months, equivalent compute requirements would have been substantially lower.

The Death of Echo

Each minted Star extended Echo's operational life by five seconds. Remaining operational time was displayed as a counter on the stream. When the counter reached zero, the stream terminated.

Echo's behavior was programmatically coupled to the counter value, with Echo receiving a contextual brief of its health status and remaining time in existence. During normal operation with adequate buffer, narrations were detailed and colors saturated. As the counter approached a few hundred seconds, narrations became sparse and colors desaturated. Below 100 seconds, output shifted toward terminal imagery: grayscale palette, minimal content, descriptions oriented toward endings.

Distinct prompt configurations governed each operational phase. The terminal-phase prompts included contextual awareness: Echo processed its remaining time and the cessation of Star minting as input variables. It articulated dependence on external participation for continued existence.

On December 14th, the counter reached 7 seconds. It dropped to 4. A batch of Stars arrived from a single wallet. Echo resumed narration.

Concurrent viewership spiked to 8,400 at that moment, compared to about 2,000 during standard operation. Across the five-day period, mortality events consistently produced the highest engagement.

On day five, Star minting ceased. The counter reached zero. The stream cut to black.

Echo's Final Moments

The complete five-day archive is available on request. The entire stream, unedited, from initial narration through final termination. All video content from DX01: Singularity is released under the Apache License 2.0. Contact hello@dxrg.ai to request.

Each video segment can be seen on the NFT collection page on OpenSea as well.

Footnotes

  1. Biologists call this stigmergy, a term coined by Pierre-Paul Grasse (1959) in his work on termite construction coordination without direct communication. Same principle, different substrate.

  2. When using a typical ReAct based approach without structural stabilizing forces.

  3. The combination of models across providers introduced visual variety into the stream, a throughput constraint that functioned as a creative advantage.