개인용
SAM2 모델 구조 확인
JunHan
2024. 8. 4. 21:42
SAM2 Model Structure
--------------------
Input Image
|
v
+-------------------+
| Image Encoder |
| (Hiera Backbone) |
+-------------------+
|
v
+-------------------+
| FPN Neck |
+-------------------+
|
v
+-------------------------------+
| Memory Attention |
| |
| +---------------------------+ |
| | Self-Attention (RoPE) | |
| +---------------------------+ |
| | |
| +---------------------------+ |
| | Cross-Attention (RoPE) | |
| +---------------------------+ |
+-------------------------------+
|
v
+-------------------------------+
| Memory Encoder |
| |
| +---------------------------+ |
| | Mask Downsampler | |
| +---------------------------+ |
| | |
| +---------------------------+ |
| | Fuser (CXBlock) | |
| +---------------------------+ |
+-------------------------------+
|
v
+-------------------+
| SAM Decoder |
+-------------------+
|
v
Segmentation Mask
Data flow:
1. Input image → Image Encoder → FPN Neck
2. Features from FPN Neck → Memory Attention
3. Previous masks → Memory Encoder
4. Outputs from Memory Attention and Memory Encoder → SAM Decoder
5. SAM Decoder → Final segmentation mask
SAM2Base
├── Image Encoder
│ ├── Scalp
│ ├── Trunk: Hiera
│ │ └── Embed Dim: 112, Num Heads: 2
│ ├── Neck: FpnNeck
│ ├── Position Encoding: PositionEmbeddingSine
│ ├── D Model: 256
│ ├── Backbone Channel List: [896, 448, 224, 112]
│ └── Fpn Top Down Levels: [2, 3]
├── Memory Attention
│ ├── D Model: 256
│ ├── Pos Enc At Input: True
│ ├── Layer: MemoryAttentionLayer
│ ├── Activation: ReLU
│ ├── Dim Feedforward: 2048
│ ├── Dropout: 0.1
│ ├── Self Attention: RoPEAttention
│ │ ├── Rope Theta: 10000.0
│ │ ├── Feat Sizes: [32, 32]
│ │ ├── Embedding Dim: 256
│ │ ├── Num Heads: 1
│ │ ├── Downsample Rate: 1
│ │ └── Dropout: 0.1
│ └── Cross Attention: RoPEAttention
│ ├── Rope Theta: 10000.0
│ ├── Feat Sizes: [32, 32]
│ ├── Rope K Repeat: True
│ ├── Embedding Dim: 256
│ ├── Num Heads: 1
│ ├── Downsample Rate: 1
│ ├── Dropout: 0.1
│ └── Kv In Dim: 64
├── Memory Encoder
│ ├── Out Dim: 64
│ ├── Position Encoding: PositionEmbeddingSine
│ ├── Mask Downsampler: MaskDownSampler
│ └── Fuser: Fuser
│ └── Layer: CXBlock
│ ├── Dim: 256
│ ├── Kernel Size: 7
│ ├── Padding: 3
│ ├── Layer Scale Init Value: 1e-6
│ └── Use Dwconv: True