最近,我的磁盘空间不太够用了。
anduin@ms-server:~$ cd /swarm-vol/
anduin@ms-server:/swarm-vol$ df . -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/nvme2n1 ext4 7.0T 6.1T 559G 92% /swarm-vol
anduin@ms-server:/swarm-vol$ cd /swarm-vol/nextcloud/
anduin@ms-server:/swarm-vol/nextcloud$ df . -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/nvme0n1 ext4 916G 554G 316G 64% /swarm-vol/nextcloud
anduin@ms-server:/swarm-vol/nextcloud$ sudo fdisk -l
Disk /dev/nvme1n1: 447.13 GiB, 480103981056 bytes, 937703088 sectors
Disk model: INTEL SSDPED1D480GA
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 75C97A6C-09A4-4375-8260-7A950D36C1B4
Device Start End Sectors Size Type
/dev/nvme1n1p1 2048 1050623 1048576 512M EFI System
/dev/nvme1n1p2 1050624 937701375 936650752 446.6G Linux filesystem
Disk /dev/nvme2n1: 6.99 TiB, 7681501126656 bytes, 1875366486 sectors
Disk model: WUS4BB076D7P3E3
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk /dev/nvme0n1: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: CT1000P3PSSD8
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
anduin@ms-server:/swarm-vol/nextcloud$ cd /dev/disk/by-uuid/
anduin@ms-server:/dev/disk/by-uuid$ ls -ashl
total 0
0 drwxr-xr-x 2 root root 140 Jan 17 15:21 .
0 drwxr-xr-x 7 root root 140 Dec 28 05:45 ..
0 lrwxrwxrwx 1 root root 13 Jan 14 14:00 0377361e-2a7b-4024-a681-ea135c092cce -> ../../nvme0n1
0 lrwxrwxrwx 1 root root 13 Dec 28 05:45 49fd5e45-6074-4370-a95f-c4404920aff5 -> ../../nvme2n1
0 lrwxrwxrwx 1 root root 15 Dec 28 05:45 9C58-514E -> ../../nvme1n1p1
0 lrwxrwxrwx 1 root root 15 Dec 28 05:45 b91352af-9477-4684-8d08-2a45c39bec98 -> ../../nvme1n1p2
anduin@ms-server:/dev/disk/by-uuid$ cat /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
UUID=b91352af-9477-4684-8d08-2a45c39bec98 / ext4 errors=remount-ro 0 1
UUID=9C58-514E /boot/efi vfat umask=0077 0 1
/dev/disk/by-uuid/49fd5e45-6074-4370-a95f-c4404920aff5 /swarm-vol ext4 defaults,noatime,nofail 0 0
/dev/disk/by-uuid/0377361e-2a7b-4024-a681-ea135c092cce /swarm-vol/nextcloud ext4 defaults,noatime,nofail 0 0
/swapfile none swap sw 0 0
由上面的信息,不难判断出:
我的系统盘是 b91352af-9477-4684-8d08-2a45c39bec98 ,当然这和我们要调查的内容没什么关系。
我的数据都放在了 /swarm-vol 这个文件夹。它背后的磁盘是 49fd5e45-6074-4370-a95f-c4404920aff5
即使我暂时使用奇技淫巧,将 /swarm-vol 下的子文件夹 nextcloud 暂时挪到了 0377361e-2a7b-4024-a681-ea135c092cce
下,还是濒临不够了。
但是,幸运的是,我购买了一个全新的大而慢的机械硬盘:
Disk /dev/sda: 58.21 TiB, 64003468427264 bytes, 125006774272 sectors
Disk model: RAID5
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
为了测试它,我暂时挂载到了这里:
/dev/sda /mnt/temp_big ext4 defaults,noatime,nofail 0 0
接下来,我认为我需要开始设计我的迁移改造计划。
为了充分发挥我过去 49fd5e45-6074-4370-a95f-c4404920aff5,也就是nvme2n1,也就是 /swarm-vol 的快的固态的特性,又能发挥 /dev/sda 大的优点,我计划这样设计:
使用 bcache 系统,让 /dev/sda 作为真正的存储设备,再让 49fd5e45-6074-4370-a95f-c4404920aff5 作为缓存盘,同时开启写入缓存和阅读缓存,这样我就拥有又大有快的存储了。
考虑到我的缓存盘非常大(上面的信息可以得出,它足足有 6.99 TB 对吧?),我相信我可以设置非常激进的写入缓存和阅读缓存。而且我的缓存盘非常可靠,它几乎不会损坏,我也不担心短暂的数据丢失。我又不是银行,就都是电影。
接下来,为了方便迁移,我开始设计我的迁移计划:
阶段概要
第一阶段 - 双数据阶段
将 sda 格式化清空,作为 bcache 的后端。此时 nvme2n1 继续承载业务数据。不移除它。然后将业务数据使用 rsync 拷贝到 sda 中。
第二阶段 - 暂停业务阶段
将业务暂停,然后我最后运行一次rsync。这次rsync应该会跑得很快,因为只产生了增量数据差异。此时此刻,nvme2n1 (ext4)的数据,和 sda (bacache的后端)的数据是完全相同的。
第三阶段 - 重构存储阶段
将 nvme2n1 格式化。然后让它作为 bcache 的缓存端。再将得到的 bcache 虚拟盘,挂载到 /swarm-vol,实现业务无感。然后重启业务。
注意:我没有任何额外的新空间可以用于备份!所以我的命令必须一次成功!一旦失败我们将万劫不复!
第一阶段
接下来,我要开始第一阶段的迁移了。我第一阶段计划这么做:
目标
- 使用 make-bcache 将 /dev/sda 建立为 bcache 的后端(backing device)。
- 先不动现有 /dev/nvme2n1(现挂载于 /swarm-vol)上的业务数据,让业务继续运行。
- 格式化出的 /dev/bcache0 上创建一个文件系统(例如 ext4),然后将现有数据从 /swarm-vol 同步到这个新地方。
- 这是“第一阶段”,意在让 /dev/sda 上也有一份业务数据拷贝,从而腾出后续的操作空间。
结果
- 最终会拥有两份数据:
- 原始:/swarm-vol(在 /dev/nvme2n1 上)
- 新的:/mnt/bcache(对应 /dev/bcache0,后端实际上是 /dev/sda)
- 业务不中断
我可以让服务继续使用 /swarm-vol,只要我在第一阶段只做数据拷贝、而不改动 /swarm-vol 自身。 在第一阶段结束后,等我准备好,可以进入“第二阶段”短暂停机做增量 rsync 以及最终切换。
# 安装 bcache-tools
sudo apt install bcache-tools
# 仅示例,注意操作前先确认 /dev/sda 确实空置
# (在 fdisk 交互式命令中,删除旧分区、新建分区)
sudo fdisk /dev/sda
# 使用 wipefs 清除 sda 上的所有签名
sudo wipefs -a /dev/sda
# 创建 bcache 后端
sudo make-bcache -B /dev/sda
# 如果在 fdisk 里没有找到 /dev/bcache0,可以尝试
# 重新加载内核模块:
sudo modprobe bcache
# 如果还是没有,尝试手工创建
sudo echo /dev/sda > /sys/fs/bcache/register
# 确认后端创建成功
# UUID: d5a45ab0-60b2-4f3a-8cf1-4d4ca97c018c
# Set UUID: 01442457-240d-4bf4-8140-b7a647659beb
# version: 1
# block_size: 1
# data_offset: 16
# 格式化后端
ls -ashl /dev/bcache0
sudo mkfs.ext4 /dev/bcache0
# 创建挂载点
sudo mkdir /mnt/bcache
# 挂载 bcache 后端
sudo mount /dev/bcache0 /mnt/bcache
# 确认挂载成功
cd /mnt/bcache
# 确认挂载成功
df . -Th
# (确认挂载成功后,开始 rsync)
sudo rsync -Aavx --update --delete /swarm-vol/ /mnt/bcache/
# 同步 nextcloud 文件夹
sudo rsync -Aavx --update --delete /swarm-vol/nexcloud/ /mnt/bcache/swarm-vol/
第二阶段 - 暂停业务并做最终同步
在这一阶段,我将:
- 暂停业务,使其不再写入
/swarm-vol
(也就是旧的 nvme2n1)。 - 做最后一次增量 rsync,保证数据在 /dev/bcache0(后端 sda)上与旧数据完全一致。
- 卸载旧的
/swarm-vol
,改为挂载新的/dev/bcache0
到/swarm-vol
,这样就完成了切换。
示例脚本(在生产环境中,请根据自己实际服务的暂停方式作相应调整):
# 1) 暂停业务
echo "停止相关业务/服务 (示例:docker-compose 或 systemctl stop 等)"
docker-compose down
sudo reboot # 重启服务器,确保业务不再写入
# 2) 做最后一次增量同步
sudo rsync -Aavx --update --delete /swarm-vol/ /mnt/bcache/
sudo rsync -Aavx --update --delete /swarm-vol/nextcloud/ /mnt/bcache/swarm-vol/
# 3) 切换挂载点
sudo umount /swarm-vol
echo "将 bcache0 挂载为新的 /swarm-vol..."
sudo mount /dev/bcache0 /swarm-vol
echo "检查挂载..."
df -Th /swarm-vol
echo "请人工确认 /swarm-vol 中的数据完整性;若无误,可以继续。"
在执行完成后,/swarm-vol
已经切换到基于 /dev/bcache0
(后端是 /dev/sda
)的存储,业务就可以使用这套新存储。此时 nvme2n1
上的原有 ext4 数据已不再对外提供服务,但仍在物理上保留(尚未被清空)。
第三阶段 - 将原 nvme2n1 作为 bcache 缓存设备
在这一阶段,我将:
- 确认
/swarm-vol
已经切换成功、业务运行正常且数据安全无误。 - 清空并格式化原本的
nvme2n1
为 bcache 缓存盘。 - 将缓存盘附加到已经存在的 bcache 后端(即
/dev/sda
)上,使两者变为真正的 “大容量 + SSD 缓存” 组合。 - 根据需求,启用写回缓存(writeback)等激进模式。
示例脚本:
# 1) 确认当前 /swarm-vol 已经是 /dev/bcache0,且业务正常
# (需人工自行验证,确认数据已在 /dev/sda + /dev/bcache0 上)
# 此时可以停一下业务,或保持低负载也行,避免写入影响。
# 2) 清空 nvme2n1 (原来的 /swarm-vol) 注意,这将销毁原数据!
echo "准备清空 /dev/nvme2n1..."
sudo umount /dev/nvme2n1 || true # 若尚未卸载,可忽略报错
sudo wipefs -a /dev/nvme2n1
# 3) 将 nvme2n1 作为缓存盘初始化
echo "对 /dev/nvme2n1 执行 make-bcache -C(cache)..."
#在这个例子里,默认的block大小是512B、bucket大小是128kB。block的大小应该与后端设备的sector大小匹配(通常是512或者4k)。bucket的大小应该与缓存设备的擦除block大小匹配(以减少写入放大)。例如,如果是一个4k sector的HDD和一个擦除block大小是2MB的SSD搭配,命令就应该是这样的:
# sudo make-bcache --block 4k --bucket 2M -C /dev/nvme2n1
# 如果你需要查看 /dev/sda (也就是后端)的 block size,可以使用 fdisk -l /dev/sda 等命令。
# 如果你需要查看 /dev/nvme2n1 的擦除块大小,可以使用 nvme id-ns /dev/nvme2n1 等命令。一般是 4M
sudo make-bcache --block 512 --bucket 4M -C /dev/nvme2n1
echo "检查生成的缓存盘信息..."
sudo bcache-super-show /dev/nvme2n1 | grep -E "cset.uuid|dev.uuid"
# 假设输出中 cset.uuid (或 dev.uuid) 为 11111111-2222-3333-4444-555555555555
# (这里仅演示,我需要看实际输出)
CACHE_UUID="(此处填上实际的 cset.uuid)"
# 4) 将缓存设备附加到现有的 /dev/bcache0(后端 /dev/sda)
# /dev/bcache0 的 sysfs 路径可通过 ls /sys/block/bcache0/bcache 等命令确认
echo "附加缓存到现有 bcache 后端..."
echo "$CACHE_UUID" | sudo tee /sys/block/bcache0/bcache/attach
# 如果我看到 echo: write error: Invalid argument,通常是 block size 不匹配等问题
# 如果成功,则 /sys/block/bcache0/bcache/cache_mode 等节点应该出现
# 5) 为 bcache0 启用写回缓存模式(可选)
echo "启用写回 (writeback) 缓存模式..."
echo writeback | sudo tee /sys/block/bcache0/bcache/cache_mode
# 可选:关闭顺序IO绕过等更激进的做法
# echo 0 | sudo tee /sys/block/bcache0/bcache/sequential_cutoff
# echo 0 | sudo tee /sys/block/bcache0/bcache/writeback_percent
# 6) 确认缓存已生效
echo "确认 /dev/bcache0 依旧正常挂载在 /swarm-vol,并检查 sysfs 等信息:"
mount | grep /swarm-vol
ls -l /sys/block/bcache0/bcache
至此,我已经完成了将旧的 nvme2n1 转变为 bcache 缓存设备的操作,并和 /dev/sda
组合为统一的逻辑卷 /dev/bcache0
。接下来的要点包括:
- 开机自动挂载
- 通常推荐在
/etc/fstab
中写入对/dev/bcache0
的挂载。 - 同时需要注意在 initramfs 阶段加载 bcache 模块、或者确保
bcache-tools
的 udev 规则可以自动将 cache attach 到 backing device(以免重启后没了 /dev/bcache0)。在 Ubuntu 下,一般可通过sudo update-initramfs -u
并检查/lib/udev/rules.d/69-bcache.rules
等来确认。
- 通常推荐在
在 /etc/fsabt
中添加:
# 删除旧的 /swarm-vol 挂载
# /dev/disk/by-uuid/49fd5e45-6074-4370-a95f-c4404920aff5 /swarm-vol ext4 defaults,noatime,nofail 0 0
# 然后添加新的 /swarm-vol 挂载
/dev/bcache0 /swarm-vol ext4 defaults,noatime,nofail 0 0
-
确认写回模式的风险
- 写回模式(writeback)可以大幅提高速度,但在缓存盘掉电或故障时会丢失尚未写入后端的脏数据。既然我提到 SSD 质量较好,且并不特别在意短期丢失风险,可以大胆使用。
-
调优与监控
- 适当调节
writeback_percent
、sequential_cutoff
等 sysfs 参数可以获得性能与风险的平衡。 - 还可以用
dstat -D nvme2n1,sda
或者iostat -xm 1
来观察实际读写流量和缓存命中情况。
- 适当调节
完成后,我就拥有一个**后端极大(/dev/sda)+ 前端极快(/dev/nvme2n1 作为缓存)**的综合存储系统,挂载于 /swarm-vol
。这样就达到了我预想的“又大又快”的目的。
使用下面的命令检查其状态:
anduin@ms-server:/sys/block/bcache0/bcache$ ls
attach dirty_data sequential_cutoff stripe_size writeback_rate_fp_term_low
backing_dev_name io_disable state writeback_consider_fragment writeback_rate_fp_term_mid
backing_dev_uuid io_error_limit stats_day writeback_delay writeback_rate_i_term_inverse
cache io_errors stats_five_minute writeback_metadata writeback_rate_minimum
cache_mode label stats_hour writeback_percent writeback_rate_p_term_inverse
clear_stats partial_stripes_expensive stats_total writeback_rate writeback_rate_update_seconds
detach readahead_cache_policy stop writeback_rate_debug writeback_running
dev running stop_when_cache_set_failed writeback_rate_fp_term_high
anduin@ms-server:/sys/block/bcache0/bcache$ cat ./running
1
anduin@ms-server:/sys/block/bcache0/bcache$ cat ./state
dirty
anduin@ms-server:/sys/block/bcache0/bcache$ cat ./dirty_data
775.9M
anduin@ms-server:/sys/block/bcache0/bcache$ cat ./writeback_running
1
anduin@ms-server:/sys/block/bcache0/bcache$ cat ./backing_dev_name
sda
anduin@ms-server:/sys/block/bcache0/bcache$ cat ./cache_mode
writethrough [writeback] writearound none
anduin@ms-server:/sys/block/bcache0/bcache$ cd ./cache
anduin@ms-server:/sys/block/bcache0/bcache/cache$ ls
average_key_size bucket_size congested flash_vol_create journal_delay_ms stats_hour tree_depth
bdev0 cache0 congested_read_threshold_us internal root_usage_percent stats_total unregister
block_size cache_available_percent congested_write_threshold_us io_error_halflife stats_day stop
btree_cache_size clear_stats errors io_error_limit stats_five_minute synchronous
anduin@ms-server:/sys/block/bcache0/bcache/cache$ cat ./errors
[unregister] panic
anduin@ms-server:/sys/block/bcache0/bcache/cache$ cat ./bucket_size
512.0k
anduin@ms-server:/sys/block/bcache0/bcache/cache$ cat ./block_size
0.5k
anduin@ms-server:/sys/block/bcache0/bcache/cache$ cd ./stats_day/
anduin@ms-server:/sys/block/bcache0/bcache/cache/stats_day$ ls
bypassed cache_bypass_hits cache_bypass_misses cache_hit_ratio cache_hits cache_miss_collisions cache_misses
anduin@ms-server:/sys/block/bcache0/bcache/cache/stats_day$ cat ./cache_hit_ratio
4
anduin@ms-server:/sys/block/bcache0/bcache/cache/stats_day$ cat ./cache_hits
11611
anduin@ms-server:/sys/block/bcache0/bcache/cache/stats_day$ cat ./cache_misses
269927
anduin@ms-server:/sys/block/bcache0/bcache/cache/stats_day$ cd /swarm-vol/
anduin@ms-server:/swarm-vol$ df . -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/bcache0 ext4 58T 6.7T 49T 13% /swarm-vol
If unable to run wipefs
on a device due to Device or resource busy
error
The error Device or resource busy
indicates that the device /dev/nvme1n1
is currently in use, preventing wipefs
from accessing it. To resolve this, you need to ensure that no processes or mount points are actively using the device. Here are some steps to identify and resolve the issue:
1. Check if the device is mounted
Run:
mount | grep /dev/nvme1n1
If it is mounted, unmount it:
sudo umount /dev/nvme1n1
2. Check for active partitions
If any partitions on /dev/nvme1n1
are in use, they need to be unmounted:
lsblk
Unmap active partitions:
sudo umount /dev/nvme1n1pX # Replace "X" with the partition number
4. Check for bcache
association
The presence of bcache0
suggests bcache
is in use. Verify:
sudo bcache-super-show /dev/nvme1n1
If it is associated, unregister it:
echo 1 | sudo tee /sys/block/bcacheX/bcache/stop # Replace "bcacheX" appropriately
Clear the bcache
superblock:
sudo wipefs -a /dev/nvme1n1
5. Check for swap usage
If the device is used as swap:
cat /proc/swaps
sudo swapoff /dev/nvme1n1
6. Check for open file descriptors
List processes using the device:
sudo lsof | grep /dev/nvme1n1
Kill the processes if necessary:
sudo kill -9 <PID>
7. Retry wipefs
Once the device is no longer in use, retry:
sudo wipefs -a /dev/nvme1n1
If issues persist, let me know the exact usage scenario, and I can assist further!
If bcache device not showing up on fdisk
First, try some clean-up:
echo $cset_uuid | sudo tee /sys/fs/bcache/pendings_cleanup
echo $backing_uuid | sudo tee /sys/fs/bcache/pendings_cleanup
Use bcache-super-show to get the uuids.
Then try again to register:
echo $cset_uuid | sudo tee /sys/fs/bcache/register
echo $backing_uuid | sudo tee /sys/fs/bcache/register
The cache uuid should exist in /dev/fs/bcache if the cache device is successfully registered.
If bcache-super-show says that that the backing dev.data.cache_state state is clean and the cset.uuid consists only of zeros, the bcache device is in the invalid state and must be recreated. [source]
However, if clean, you could try force-starting the backing device without cache device:
echo 1 | sudo tee /sys/class/block/$dev/bcache/running
If unable to run wipefs
on a device due to Device or resource busy
error
The error Device or resource busy
indicates that the device /dev/nvme1n1
is currently in use, preventing wipefs
from accessing it. To resolve this, you need to ensure that no processes or mount points are actively using the device. Here are some steps to identify and resolve the issue:
1. Check if the device is mounted
Run:
mount | grep /dev/nvme1n1
If it is mounted, unmount it:
sudo umount /dev/nvme1n1
2. Check for active partitions
If any partitions on /dev/nvme1n1
are in use, they need to be unmounted:
lsblk
Unmap active partitions:
sudo umount /dev/nvme1n1pX # Replace "X" with the partition number
4. Check for bcache
association
The presence of bcache0
suggests bcache
is in use. Verify:
sudo bcache-super-show /dev/nvme1n1
If it is associated, unregister it:
echo 1 | sudo tee /sys/block/bcacheX/bcache/stop # Replace "bcacheX" appropriately
Clear the bcache
superblock:
sudo wipefs -a /dev/nvme1n1
5. Check for swap usage
If the device is used as swap:
cat /proc/swaps
sudo swapoff /dev/nvme1n1
6. Check for open file descriptors
List processes using the device:
sudo lsof | grep /dev/nvme1n1
Kill the processes if necessary:
sudo kill -9 <PID>
7. Retry wipefs
Once the device is no longer in use, retry:
sudo wipefs -a /dev/nvme1n1
If issues persist, let me know the exact usage scenario, and I can assist further!
If bcache device not showing up on fdisk
2
First, try some clean-up:
echo $cset_uuid | sudo tee /sys/fs/bcache/pendings_cleanup
echo $backing_uuid | sudo tee /sys/fs/bcache/pendings_cleanup
Use bcache-super-show to get the uuids.
Then try again to register:
echo $cset_uuid | sudo tee /sys/fs/bcache/register
echo $backing_uuid | sudo tee /sys/fs/bcache/register
The cache uuid should exist in /dev/fs/bcache if the cache device is successfully registered.
If bcache-super-show says that that the backing dev.data.cache_state state is clean and the cset.uuid consists only of zeros, the bcache device is in the invalid state and must be recreated. [source]
However, if clean, you could try force-starting the backing device without cache device:
echo 1 | sudo tee /sys/class/block/$dev/bcache/running
Eject cache
I used bcache
only in a writethrough configuration, and IIRC even then bcache
doesn't like at all if the cache device vanishes while the machine is running. Expect the bcache
device to stall completely if that happens.
I haven't tried to remove the cache device while the machine is powered down, so I can't say anything about that. I do think though that bcache
is still pretty touchy, so I'd recommend that you try that with a VM or a physical test machine first.
To safely remove the cache device, you can detach the cache set from the bcache device:
echo <cache-set-uuid> > /sys/block/bcache0/bcache/detach
To determine the necessary cache set UUID, look in /sys/fs/bcache/
:
host ~ # ll /sys/fs/bcache/
total 0
drwxr-xr-x 7 root root 0 Feb 19 00:11 eb99feda-fac7-43dc-b89d-18765e9febb6
--w------- 1 root root 4096 Feb 19 00:11 register
--w------- 1 root root 4096 Feb 7 07:17 register_quiet
So for example in this case, run:
echo eb99feda-fac7-43dc-b89d-18765e9febb6 > /sys/block/bcache0/bcache/detach
The state
file should say no cache
after that:
host ~ # cat /sys/block/bcache0/bcache/state
no cache