본문 바로가기

전체 글

3D Single-Stage Model Training 보호되어 있는 글입니다. 더보기
VoiceLDM: Text-to-Speech with Environmental Context https://arxiv.org/abs/2309.13664 VoiceLDM: Text-to-Speech with Environmental ContextThis paper presents VoiceLDM, a model designed to produce audio that accurately follows two distinct natural language text prompts: the description prompt and the content prompt. The former provides information about the overall environmental context of tharxiv.org 초록본 논문에서는 두 개의 서로 다른 자연어 텍스트 프롬프트인 설명 프롬프트와 내용 프.. 더보기
DeepNet: Scaling Transformers to 1,000 Layers (부록 추가 필요) https://arxiv.org/abs/2203.00555 DeepNet: Scaling Transformers to 1,000 LayersIn this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with theoretically derived iarxiv.org 이 논문에서는 매우 깊은 트랜스포머(Transformers)를 안정화시키는 간단하면서도 효과적인 .. 더보기
Scaling Vision Transformers (부록 추가 필요) https://arxiv.org/abs/2106.04560 Scaling Vision TransformersAttention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in attaining excellent results, therefore, understanding a model's scalinarxiv.org 초록Vision Transformer(ViT)와 같은 어텐션 기반 신경망은 최근 여러 컴퓨터 비전 벤치마크에서 최첨단 결과를 달성.. 더보기
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency https://arxiv.org/abs/2409.02634 Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion DependencyWith the introduction of diffusion-based video generation techniques, audio-conditioned human video generation has recently achieved significant breakthroughs in both the naturalness of motion and the synthesis of portrait details. Due to the limited contrarxiv.org 초록확산 기반 비디오 생성 기술의 도입으로 .. 더보기
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (VIT) (부록 추가 필요) https://ar5iv.labs.arxiv.org/html/2010.11929 An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleWhile the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with …ar5iv.labs.arxiv.org 초록Transformer 아키텍처는 자연어 처리 작업에서 사실상의 표준.. 더보기
EfficientNetV2: Smaller Models and Faster Training https://arxiv.org/abs/2104.00298 EfficientNetV2: Smaller Models and Faster TrainingThis paper introduces EfficientNetV2, a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models. To develop this family of models, we use a combination of training-aware neural architecturearxiv.org 초록이 논문은 EfficientNetV2라는 새로운 계열의 컨볼루션 신경망을 소개합니다. .. 더보기
Diffusion Models Are Real-Time Game Engines (부록 추가 필요) https://arxiv.org/abs/2408.14837 Diffusion Models Are Real-Time Game EnginesWe present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality. GameNGen can interactively simulate the classic game DOOM at over 20 framesarxiv.org 요약우리는 신경망 모델로 완전히 구동되는 최초의 게임 엔진인 GameNGen을 소개합니다. 이 엔진은 .. 더보기