본문 바로가기

인공지능

Robust Speech Recognition via Large-Scale Weak Supervision https://arxiv.org/abs/2212.04356 Robust Speech Recognition via Large-Scale Weak SupervisionWe study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standardarxiv.org 요약우리는 인터넷에 있는 대량의 오디오 전사를 예측하도록 훈련된 음성 처리.. 더보기
MotionBooth: Motion-Aware Customized Text-to-Video Generation https://arxiv.org/abs/2406.17758 MotionBooth: Motion-Aware Customized Text-to-Video GenerationIn this work, we present MotionBooth, an innovative framework designed for animating customized subjects with precise control over both object and camera movements. By leveraging a few images of a specific object, we efficiently fine-tune a text-to-video marxiv.orghttps://huggingface.co/papers/2406.1775.. 더보기
What are Diffusion Models? https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ What are Diffusion Models?[Updated on 2021-09-19: Highly recommend this blog post on score-based generative modeling by Yang Song (author of several key papers in the references)]. [Updated on 2022-08-27: Added classifier-free guidance, GLIDE, unCLIP and Imagen. [Updated on 2022-08lilianweng.github.io[2021-09-19 업데이트: 여러 주요 논문의 저자인 .. 더보기
Revisiting Feature Prediction for Learning Visual Representations from Video https://arxiv.org/abs/2404.08471 Revisiting Feature Prediction for Learning Visual Representations from VideoThis paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, tarxiv.org 이 논문은 비디오로부터 비지도 학습을 위한.. 더보기
Adding Conditional Control to Text-to-Image Diffusion Models https://arxiv.org/abs/2302.05543 Adding Conditional Control to Text-to-Image Diffusion ModelsWe present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers prarxiv.org 초록우리는 대규모 사전 학습된 텍스트-이미지 확산 모델에 공간 조건 제.. 더보기
DiT: Self-supervised Pre-training for Document Image Transformer https://arxiv.org/abs/2203.02378 DiT: Self-supervised Pre-training for Document Image TransformerImage Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose \textbf{DiT}, a self-supervisedarxiv.org요약이미지 변환기는 최근 자연 이미지 이해를 위해 감독 학습(ViT.. 더보기
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders https://arxiv.org/abs/2301.00808 ConvNeXt V2: Co-designing and Scaling ConvNets with Masked AutoencodersDriven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstratarxiv.org 다른 분야의 일을하면서 논문을 읽다보니 정신이 없다.. 더보기
Diffusion On Syntax Trees For Program Synthesis https://arxiv.org/abs/2405.20519 Diffusion On Syntax Trees For Program SynthesisLarge language models generate code one token at a time. Their autoregressive generation process lacks the feedback of observing the program's output. Training LLMs to suggest edits directly can be challenging due to the scarcity of rich edit data. To addrarxiv.org 대형 언어 모델(LLM)은 코드를 하나의 토큰씩 생성합니다. 이들의 자기회귀 생성 과정은 프로.. 더보기