SAM2 Model Structure
--------------------
Input Image
|
v
+-------------------+
| Image Encoder |
| (Hiera Backbone) |
+-------------------+
|
v
+-------------------+
| FPN Neck |
+-------------------+
|
v
+-------------------------------+
| Memory Attention |
| |
| +---------------------------+ |
| | Self-Attention (RoPE) | |
| +---------------------------+ |
| | |
| +---------------------------+ |
| | Cross-Attention (RoPE) | |
| +---------------------------+ |
+-------------------------------+
|
v
+-------------------------------+
| Memory Encoder |
| |
| +---------------------------+ |
| | Mask Downsampler | |
| +---------------------------+ |
| | |
| +---------------------------+ |
| | Fuser (CXBlock) | |
| +---------------------------+ |
+-------------------------------+
|
v
+-------------------+
| SAM Decoder |
+-------------------+
|
v
Segmentation Mask
Data flow:
1. Input image → Image Encoder → FPN Neck
2. Features from FPN Neck → Memory Attention
3. Previous masks → Memory Encoder
4. Outputs from Memory Attention and Memory Encoder → SAM Decoder
5. SAM Decoder → Final segmentation mask
SAM2Base
├── Image Encoder
│ ├── Scalp
│ ├── Trunk: Hiera
│ │ └── Embed Dim: 112, Num Heads: 2
│ ├── Neck: FpnNeck
│ ├── Position Encoding: PositionEmbeddingSine
│ ├── D Model: 256
│ ├── Backbone Channel List: [896, 448, 224, 112]
│ └── Fpn Top Down Levels: [2, 3]
├── Memory Attention
│ ├── D Model: 256
│ ├── Pos Enc At Input: True
│ ├── Layer: MemoryAttentionLayer
│ ├── Activation: ReLU
│ ├── Dim Feedforward: 2048
│ ├── Dropout: 0.1
│ ├── Self Attention: RoPEAttention
│ │ ├── Rope Theta: 10000.0
│ │ ├── Feat Sizes: [32, 32]
│ │ ├── Embedding Dim: 256
│ │ ├── Num Heads: 1
│ │ ├── Downsample Rate: 1
│ │ └── Dropout: 0.1
│ └── Cross Attention: RoPEAttention
│ ├── Rope Theta: 10000.0
│ ├── Feat Sizes: [32, 32]
│ ├── Rope K Repeat: True
│ ├── Embedding Dim: 256
│ ├── Num Heads: 1
│ ├── Downsample Rate: 1
│ ├── Dropout: 0.1
│ └── Kv In Dim: 64
├── Memory Encoder
│ ├── Out Dim: 64
│ ├── Position Encoding: PositionEmbeddingSine
│ ├── Mask Downsampler: MaskDownSampler
│ └── Fuser: Fuser
│ └── Layer: CXBlock
│ ├── Dim: 256
│ ├── Kernel Size: 7
│ ├── Padding: 3
│ ├── Layer Scale Init Value: 1e-6
│ └── Use Dwconv: True
'개인용' 카테고리의 다른 글
ComfyUI 링크 (0) | 2024.08.06 |
---|---|
openai voice (0) | 2024.08.04 |
torchviz (0) | 2024.08.01 |
weight를 볼 수 있는 사이트 (0) | 2024.07.31 |
다중 모드 모델 평가 순위표 (중국 사이트) (0) | 2024.07.26 |