2007年3月29日星期四

Solaris硬盘分区结构

Solaris下,一个磁盘包含8个分区,标记为0-7。此信息可以通过format命令,然后选择一个硬盘来看到,例如,在我自己的系统中(Solaris 9,Ultra 60),显示出来的信息如下:
# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
0. c0t0d0 <SUN18G cyl 7506 alt 2 hd 19 sec 248>
/pci@1f,4000/scsi@3/sd@0,0
Specify disk (enter its number): 0
selecting c0t0d0
[disk formatted]
Warning: Current Disk has mounted partitions.

FORMAT MENU:
disk - select a disk
type - select (define) a disk type
partition - select (define) a partition table
current - describe the current disk
format - format and analyze the disk
repair - repair a defective sector
label - write label to the disk
analyze - surface analysis
defect - defect list management
backup - search for backup labels
verify - read and display labels
save - save new disk/partition definitions
inquiry - show vendor, product and revision
volname - set 8-character volume name
!<cmd> - execute <cmd>, then return
quit
format> p


PARTITION MENU:
0 - change `0' partition
1 - change `1' partition
2 - change `2' partition
3 - change `3' partition
4 - change `4' partition
5 - change `5' partition
6 - change `6' partition
7 - change `7' partition
select - select a predefined table
modify - modify a predefined partition table
name - name the current table
print - display the current table
label - write partition map and label to the disk
!<cmd> - execute <cmd>, then return
quit
partition>
不要看到内容这么多,就被吓住了,其实,format命令之后0. c0t0d0 <SUN18G cyl 7506 alt 2 hd 19 sec 248>/pci@1f,4000/scsi@3/sd@0,0所显示出来的含义很简单,0. c0t0d0就代表这台Ultra 60里面只装了一个硬盘(至于c0t0d0的具体含义,稍后会介绍),<SUN18G cyl 7506 alt 2 hd 19 sec 248>代表的是这个硬盘的大小和柱面信息,/pci@1f,4000/scsi@3/sd@0,0所代表的,就是这个硬盘的实际物理地址。这些信息看起来很复杂,其实一般都只需要看看format命令抓出来的硬盘数量,是不是我们装在系统上的数量,例如你装了两个硬盘,但是这里只有一个硬盘的信息,就需要认真面对了。
FORMAT MENU:
disk - select a disk
type - select (define) a disk type
partition - select (define) a partition table
current - describe the current disk
format - format and analyze the disk
repair - repair a defective sector
label - write label to the disk
analyze - surface analysis
defect - defect list management
backup - search for backup labels
verify - read and display labels
save - save new disk/partition definitions
inquiry - show vendor, product and revision
volname - set 8-character volume name
!<cmd> - execute <cmd>, then return
quit
format> p
这里所列出来的,是可以使用的命令,比如我在最下面format>,就是用了p这个命令(慢点,上面没有p这个命令啊?其实,这里p就是partition的简写),然后,列出了以下内容:
PARTITION MENU:
0 - change `0' partition
1 - change `1' partition
2 - change `2' partition
3 - change `3' partition
4 - change `4' partition
5 - change `5' partition
6 - change `6' partition
7 - change `7' partition
select - select a predefined table
modify - modify a predefined partition table
name - name the current table
print - display the current table
label - write partition map and label to the disk
!<cmd> - execute <cmd>, then return
quit
partition>
OK,到这里,我们的目的也达到了,这些信息应该很清楚的证明了:Solaris下,一个磁盘包含8个分区,标记为0-7。后面的仍然是一些可以用到的命令,这些命令的具体含义,大家可以看命令后面的英文介绍,至于怎么使用,以后再说。下面我们将进入今天的重点。
向一块硬盘写入数据之前,首先需要将其分区和格式化,这个过程一般可以分为3个步骤:
1. 物理格式化,也就是通常所说的低级格式化(Low-Level Formatting,LLF);
2. 分区;
3. 逻辑格式化,也就是通常所说的高级格式化(High-Level Formatting,HLF)
低级格式化的时候,硬盘被分成若干个磁道,这些磁道又被分成若干个扇区,每个扇区填充了随机数据。几乎所有的硬盘在出厂前都已经被低级格式化过,所以,用户只要对硬盘进行下面两个步骤(分区和逻辑格式化)就可以了。
分区的动作将硬盘分成几个部分,成为分区或者是分片(注意:前面的"分区"是动词,后面的是名词哦)。每个分区/分片由若干个柱面组成。绝大多数下,Solaris中的一个分区一一对应一个文件系统。一个分区不能包含多个文件系统;同样,一个文件系统也不能跨越多个分区。Solaris中,对硬盘进行分区,就是使用我们开始的时候使用过的format命令。
当Solaris进行高级格式化的时候,将每个分区分成许多柱面组,每个柱面组包括了几个连续的柱面。文件系统在这些柱面组中建立文件和目录,并尽量将同一个文件的数据保存在同一个柱面组中。这样的机制能够保证磁头读取数据的时候移动最少,从而加快数据的读取速度。Solaris中使用newfs命令来实现高级格式化,默认的文件系统是UNIX文件系统(UFS:Unix File System),它使用下列类型的块:
1. 引导块:存储系统启动时所需的信息
2. 超级块:存储文件系统信息
3. 索引节点(i节点):存储文件系统中的单个文件信息
4. 存储块/数据块:存储文件数据

下面来详细介绍一下这几种类型的块。
引导块:
引导块存储系统启动时所需的信息。引导块总是位于硬盘的第一个柱面组,占用分区的前8KB。

超级块:
超级块存储文件系统信息,它包含了下列信息:
1. 文件系统中总块数(文件系统大小)
2. 文件系统中数据块的数目
3. 索引节点的数目
4. 柱面组的数目
5. 块的大小
6. 磁盘碎片的大小
7. 空闲块的个数
8. 空闲的索引节点的个数
超级块对文件系统极其重要,所以Solairs系统采用多个备份来确保它的安全。偶尔,当没有正常关闭系统或硬盘出现故障的时候,会造成默认超级块不能正确读取或者是和其备份的超级块不一致。这时,就需要进行修复工作。通常在重新启动系统的时候,系统会调用fsck命令来自动完成。当fsck发现默认的超级块已经损坏而且无法自动修复的时候,会提示用户手动进行修复。
手动进行修复的时候,可以根据以下几个步骤进行:
1. 以单用户的身份进入系统,例如在PROM的模式下(即ok状态下)用命令boot -s可进入单用户模式,或者在系统中sync;sync;sync;init 0也可进入单用户模式(至于什么是单用户模式,以后会详细介绍)
2. 如果损坏的文件已经安装到文件树中,可以进入另一个目录,然后将损坏的文件系统卸载,例如:
#cd /
#umount /var
3. 使用newfs -N命令显示超级块的值,此命令会列出备用的超级块在文件系统中的位置:
# newfs -N /dev/dsk/c0t0d0s1
/dev/rdsk/c0t0d0s1: 961248 sectors in 204 cylinders of 19 tracks, 248 sectors
469.4MB in 13 cyl groups (16 c/g, 36.81MB/g, 17664 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 75680, 151328, 226976, 302624, 378272, 453920, 529568, 605216, 680864,
756512, 832160, 907808,
4. 从newfs -N命令列出的备用超级块中选择一个作为fsck命令的一个选项进行修复:
#fsck -F ufs -o b=453920 /dev/rdsk/c0t0d0s1

索引节点
索引节点包含了一个文件除去文件名以外的所有信息。一个索引节点占用128字节的磁盘空间,它包含了下列信息:
1. 文件类型:普通文件、目录、块设备文件、字符设备文件、链接等
2. 文件权限:读、写、执行权限的组合
3. 文件的硬链接数
4. 文件所有者的用户ID
5. 文件所属的组ID
6. 文件大小(字节数)
7. 一个包含15个磁盘块地址的数组
8. 文件最近的访问日期和时间
9. 文件最后一次修改的日期和时间
10. 文件创建的日期和时间

硬盘上的每个文件,都有一个描述它的信息的索引节点。文件系统创建的时候,一定数目的索引节点在硬盘柱面组中被同时创建。有时候,这些索引节点或许会不够用,例如当一个程序产生大量小文件的时候,此时文件系统就需要增加索引节点。同样,如果我们事先知道此文件系统只用来存放少数大文件,我们就可以通过减少索引节点的数目来达到节省磁盘空间的目的――毕竟每个索引节点占用128字节。创建文件系统的时候,可以使用newfs命令的-i选项来增加或者减少索引节点的数目。/usr/ucb目录下的df命令可以查看文件系统中所引节点的状况。例如:
# usr/ucb/df -i
Filesystem iused ifree %iused Mounted on
/dev/dsk/c0t0d0s0 131672 1929384 6% /
注意:文件系统一旦创建,就无法改变它的索引节点数目,因此,当索引节点不够用的时候,首先应当备份此文件系统数据,然后创建一个包含更多索引节点的新的文件系统,然后将备份的到新的文件系统就可以了。

存储块/数据块
存储块,也叫做数据块,它占用了文件系统的其他所有空间。这些块包含了存放在磁盘上的数据文件。每个存储块的大小在创建文件系统的时候被确定。对一个普通文件来说,存储块存放了文件的内容,对一个目录来说,存储块中存放了此目录中所有文件的索引节点号和文件名的信息。

磁盘命名
这里主要是详细解释我们最开始使用format命令的时候,看到的c0t0d0这个表示的详细含义。在Solaris系统中用设备名来代表磁盘。磁盘设备名是类似cXtXdX格式的一系列字母和数字,比如我们看到的c0t0d0。设备名中的字母(c, t,d)都是一样的,但X代表的数字表示特定的磁盘或者系统。例如c0t0d0表示0号控制器,0号磁盘,0号LUN,这通常指代系统中的第一个硬盘,往往也是系统的启动磁盘(boot disk)。
Sun使用下列命名方式定义逻辑设备名:
/dev/[r]dsk/cXtXdXsX
c:逻辑控制器号(逻辑控制器)
t:物理总线目标号
d:磁盘或逻辑单元号(LUN)
s:分区号
cX : X指磁盘控制器。当SUN系统搜集安装在系统中的磁盘控制器信息时,它给每个磁盘控制器一个数值,数值取决于系统监测控制器的先后顺序。第一个被检测到的控制器分配的数值是0,第二个控制器是1,依次类推。对IDE系统来说,第一个IDE通道为0,第二个(如果存在的话)通道为1。
tX : X指磁盘的目标号。这个数字有时候被称为SCSI标识符,磁盘控制器上的每一个磁盘都有一个唯一的目标号。控制器通过这个目标号可以对每个磁盘独立寻址。对IDE磁盘而言,主盘的目标号是0,从盘的目标号是1。
dX : X指磁盘的逻辑单元号(LUN)。在有些中,LUN被用来区分系统中的各个磁盘。一个阵列可以用一个目标号来表示一组磁盘,然后用LUN来表示这个磁盘组中的单个磁盘。这种方式被广泛应用于SCSI和光盘转换设备中。对单个磁盘或IDE磁盘,这个数字总是设定为0。
sX : X指磁盘上的分区号。它和磁盘的分区对应。就如我们前面所说的"Solaris下,一个磁盘包含8个分区,标记为0-7",因为这里X的数值只能是0――7。
以上的内容,就可以详细说明出c0t0d0s0的含义了。

 

张志强
2007-03-29

磁盘阵列(Disk Array)原理

1.为什么需要?
如何增加磁盘的存取(access)速度,如何防止数据因磁盘的故障而失落及如何有效的利用磁盘空间,一直是电脑专业人员和用户的困扰;而大容量磁盘的价格非常昂贵,对用户形成很大的负担。技术的产生一举解决了这些问题。

过去十几年来,CPU的处理速度增加了五十倍有多,内存(memory)的存取速度亦大幅增加,而数据储存装置--主要是磁盘(hard disk)--的存取速度只增加了三、四倍,形成电脑系统的瓶颈,拉低了电脑系统的整体性能(through put),若不能有效的提升磁盘的存取速度,CPU、内存及磁盘间的不平衡将使CPU及内存的改进形成浪费。

目前改进磁盘存取速度的的方式主要有两种。一是磁盘快取控制(disk cache controller),它将从磁盘读取的数据存在快取内存(cache memory)中以减少磁盘存取的次数,数据的读写都在快取内存中进行,大幅增加存取的速度,如要读取的数据不在快取内存中,或要写数据到磁盘时,才做磁盘的存取动作。这种方式在单工环境(single- tasking envioronment)如DOS之下,对大量数据的存取有很好的性能(量小且频繁的存取则不然),但在多工(multi-tasking)环境之下(因为要不停的作数据交换(swapping) 的动作)或数据库(database)的存取(因为每一记录都很小)就不能显示其性能。这种方式没有任何安全保障。

其二是使用的技术。是把多个磁盘组成一个阵列,当作单一磁盘使用,它将数据以分段(striping)的方式储存在不同的磁盘中,存取数据时,阵列中的相关磁盘一起动作,大幅减低数据的存取时间,同时有更佳的空间利用率。所利用的不同的技术,称为RAID level,不同的level针对不同的系统及应用,以解决数据安全
的问题。

一般高性能的都是以硬件的形式来达成,进一步的把磁盘快取控制及结合在一个控制器(RAID controler或控制卡上,针对不同的用户解决人们对磁盘输出入系统的四大要求:
(1)增加存取速度,
(2)容错(fault tolerance),即安全性
(3)有效的利用磁盘空间;
(4)尽量的平衡CPU,内存及磁盘的性能差异,提高电脑的整体工作性能。

2.原理

中针对不同的应用使用的不同技术,称为RAID level,RAID是Redundent Array of Inexpensive Disks的缩写,而每一level代表一种技术,目前业界公认的标准是RAID 0~RAID 5。这个level并不代表技术的高低,level 5并不高于level 3,level 1也不低过level 4,至于要选择那一种RAID level的产品,纯视用户的操作环境(operating environment)及应用(application)而定,与level的高低没有必然的关系。
RAID 0及RAID 1适用于PC及PC相关的系统如小型的网络服务器(network server)及需要高磁盘容量与快速磁盘存取的工作站等,比较便宜;RAID 3及RAID 4适用于大型电脑及影像、CAD/CAM等处理;RAID 5多用于OLTP(在线事务处理),因有金融机构及大型数据处理中心的迫切需要,故使用较多而较有名气, RAID 2较少使用,其他如RAID 6,RAID 7,乃至RAID 10等,都是厂商各做各的,并无一致的标准,在此不作说明。介绍各个RAID level之前, 先看看形成的两个基本技术:

磁盘延伸(Disk Spanning):

译为磁盘延伸,能确切的表示disk spanning这种技术的含义。如图控制器, 联接了四个磁盘,这四个磁盘形成一个阵列(array),而的控制器(RAID controller)是将此四个磁盘视为单一的磁盘,如DOS环境下的C:盘。这是disk spanning的意义,因为把小容量的磁盘延伸为大容量的单一磁盘,用户不必规划数据在各磁盘的分布,而且提高了磁盘空间的使用率。并使磁盘容量几乎可作无限的延伸;而各个磁盘一起作取存的动作,比单一磁盘更为快捷。很明显的,有此阵列的形成而产生RAID的各种技术。


磁盘或数据分段(Disk Striping or Data Striping):

因为是将同一阵列的多个磁盘视为单一的虚拟磁盘(virtual disk),所以其数据是以分段(block or segment)的方式顺序存放在中,数据按需要分段,从第一个磁盘开始放,放到最後一个磁盘再回到第一个磁盘放起,直到数据分布完毕。至于分段的大小视系统而定,有的系统或以1KB最有效率,或以4KB,或以6KB,甚至是4MB或8MB的,但除非数据小于一个扇区(sector,即521bytes),否则其分段应是512byte的倍数。因为磁盘的读写是以一个扇区为单位,若数据小于512bytes,系统读取该扇区后,还要做组合或分组(视读或写而定)的动作,浪费时间。从上图我们可以看出,数据以分段于在不同的磁盘,整个阵列的各个磁盘可同时作读写,故数据分段使数据的存取有最好的效率,理论上本来读一个包含四个分段的数据所需要的时间约=(磁盘的access time+数据的tranfer time)X4次,现在只要一次就可以完成。

若以N表示磁盘的数目,R表示读取,W表示写入,S表示可使用空间,则数据分段的性能为:
R:N(可同时读取所有磁盘)
W:N(可同时写入所有磁盘)
S:N(可利用所有的磁盘,并有最佳的使用率)

Disk striping也称为RAID 0,很多人以为RAID 0没有甚么,其实这是非常错误的观念, 因为RAID 0使磁盘的输出入有最高的效率。而有更好效率的原因除数据分段外,它可以同时执行多个输出入的要求,因为阵列中的每一个磁盘都能独立动作,分段放在不同的磁盘,不同的磁盘可同时作读写,而且能在快取内存及磁盘作并行存取(parallel access)的动作,但只有硬件的才有此性能表现。

从上面两点我们可以看出,disk spanning定义了RAID的基本形式,提供了一个便宜、灵活、高性能的系统结构,而disk striping解决了数据的存取效率和磁盘的利用率问题,RAID 1至RAID 5是在此基础上提供磁盘安全的方案。

RAID 1

RAID 1是使用磁盘镜像(disk mirroring)的技术。磁盘镜像应用在RAID 1之前就在很多系统中使用,它的方式是在工作磁盘(working disk)之外再加一额外的备份磁盘(backup disk),两个磁盘所储存的数据完全一样,数据写入工作磁盘的同时亦写入备份磁盘。磁盘镜像不见得就是RAID 1,如Novell Netware亦有提供磁盘镜像的功能,但并不表示Netware有了RAID 1的功能。一般磁盘镜像和RAID 1有二点最大的不同:

RAID 1无工作磁盘和备份磁盘之分,多个磁盘可同时动作而有重叠(overlaping)读取的功能,甚至不同的镜像磁盘可同时作写入的动作,这是一种最佳化的方式,称为负载平衡(load-balance)。例如有多个用户在同一时间要读取数据,系统能同时驱动互相镜像的磁盘,同时读取数据,以减轻系统的负载,增加I/O的性能。

RAID 1的磁盘是以磁盘延伸的方式形成阵列,而数据是以数据分段的方式作储存,因而在读取时,它几乎和RAID 0有同样的性能。从RAID的结构就可以很清楚的看出RAID 1和一般磁盘镜像的不同。

下图为RAID 1,每一笔数据都储存两份:
从图可以看出:
R:N(可同时读取所有磁盘)
W:N/2(同时写入磁盘数)
S:N/2(利用率)

读取数据时可用到所有的磁盘,充分发挥数据分段的优点;写入数据时,因为有备份,所以要写入两个磁盘,其效率是N/2,磁盘空间的使用率也只有全部磁盘的一半。

很多人以为RAID 1要加一个额外的磁盘,形成浪费而不看好RAID 1,事实上磁盘越来越便宜,并不见得造成负担,况且RAID 1有最好的容错(fault tolerence)能力,其效率也是除RAID 0之外最好的。

在的技术上,从RAID 1到RAID 5,不停机的意思表示在工作时如发生磁盘故障, 系统能持续工作而不停顿,仍然可作磁盘的存取,正常的读写数据;而容错则表示即使磁盘故障,数据仍能保持完整,可让系统存取到正确的数据,而SCSI的更可在工作中抽换磁盘,并可自动重建故障磁盘的数据。之所以能做到容错及不停机, 是因为它有冗余的磁盘空间可资利用,这也就是Redundant的意义。

RAID 2

RAID 2是把数据分散为位(bit)或块(block),加入海明码Hamming Code,在中作间隔写入(interleaving)到每个磁盘中,而且地址(address)都一样,也就是在各个磁盘中,其数据都在相同的磁道(cylinder or track)及扇区中。RAID 2的设计是使用共轴同步(spindle synchronize)的技术,存取数据时,整个一起动作,在各作磁
盘的相同位置作平行存取,所以有最好的存取时间(accesstime),其总线(bus)是特别的设计,以大带宽(band wide)并行传输所存取的数据,所以有最好的传输时间(transfer time)。在大型档案的存取应用,RAID 2有最好的性能,但如果档案太小,会将其性能拉下来,因为磁盘的存取是以扇区为单位,而RAID 2的存取是所有磁盘平行动作,而且是作
单位元的存取,故小于一个扇区的数据量会使其性能大打折扣。RAID 2是设计给需要连续且大量数据的电脑使用的,如大型电脑(mainframe to supercomputer)、作影像处理或CAD/CAM的工作站(workstation)等,并不适用于一般的多用户环境、网络服务器 (network server),小型机或PC。

RAID 2的安全采用内存阵列(memory array)的技术,使用多个额外的磁盘作单位错误校正(single-bit correction)及双位错误检测(double-bit detection);至于需要多少个额外的磁盘,则视其所采用的方法及结构而定,例如八个数据磁盘的阵列可能需要三个额外的磁盘,有三十二个数据磁盘的高档阵列可能需要七个额外的磁盘。


RAID 3

RAID 3的数据储存及存取方式都和RAID 2一样,但在安全方面以奇偶校验(parity check)取代海明码做错误校正及检测,所以只需要一个额外的校检磁盘(parity disk)。奇偶校验值的计算是以各个磁盘的相对应位作XOR的逻辑运算,然后将结果写入奇偶校验磁盘,任何数据的修改都要做奇偶校验计算,

如某一磁盘故障,换上新的磁盘后,整个(包括奇偶校验磁盘)需重新计算一次, 将故障磁盘的并写入新磁盘中;如奇偶校验磁盘故障,则重新计算奇偶校验值, 以达容错的要求.

较之RAID 1及RAID 2,RAID 3有85%的磁盘空间利用率,其性能比RAID 2稍差,因为要做奇偶校验计算;共轴同步的平行存取在读档案时有很好的性能,但在写入时较慢,需要重新计算及修改奇偶校验磁盘的内容。RAID 3和RAID 2有同样的应用方式,适用大档案及大量数据输出入的应用,并不适用于PC及网络服务器。

RAID 4

RAID 4也使用一个校验磁盘,但和RAID 3不一样

RAID 4是以扇区作数据分段,各磁盘相同位置的分段形成一个校验磁盘分段(parity block),放在校验磁盘。这种方式可在不同的磁盘平行执行不同的读取命今,大幅提高的读取性能;但写入数据时,因受限于校验磁盘,同一时间只能作一次,启动所有磁盘读取数据形成同一校验分段的所有数据分段,与要写入的数据做好校验计算再写入。即使如此,小型档案的写入仍然比RAID 3要快,因其校验计算较简单而非作位(bit level)的计算;但校验磁盘形成RAID 4的瓶颈,降低了性能,因有RAID 5而使得RAID 4较少使用。

RAID 5
RAID5避免了RAID 4的瓶颈,方法是不用校验磁盘而将校验数据以循环的方式放在每一个磁盘中,

的第一个磁盘分段是校验值,第二个磁盘至后一个磁盘再折回第一个磁盘的分段是数据,然后第二个磁盘的分段是校验值,从第三个磁盘再折回第二个磁盘的分段是数据,以此类推,直到放完为止。图中的第一个parity block是由A0,A1...,B1,B2计算出来,第二个parity block是由B3,B4,...,C4,D0计算出来,也就是校验值是由各磁盘
同一位置的分段的数据所计算出来。这种方式能大幅增加小档案的存取性能,不但可同时读取,甚至有可能同时执行多个写入的动作,如可写入数据到磁盘1而其parity block在磁盘2,同时写入数据到磁盘4而其parity block在磁盘1,这对联机交易处理 (OLTP,On-Line Transaction Processing)如银行系统、金融、股市等或大型数据库的
处理提供了最佳的解决方案(solution),因为这些应用的每一笔数据量小,磁盘输出入频繁而且必须容错。

事实上RAID 5的性能并无如此理想,因为任何数据的修改,都要把同一parityblock的所有数据读出来修改后,做完校验计算再写回去,也就是RMW cycle(Read-Modify-Write cycle,这个cycle没有包括校验计算);正因为牵一而动全身,所以:
R:N(可同时读取所有磁盘)
W:1(可同时写入磁盘数)
S:N-1(利用率)

RAID 5的控制比较复杂,尤其是利用硬件对的控制,因为这种方式的应用比其他的RAID level要掌握更多的事情,有更多的输出入需求,既要速度快,又要处理数据,计算校验值,做错误校正等,所以价格较高;其应用最好是OLTP,至于用于图像处理等, 不见得有最佳的性能。

2.的额外容错功能:Spare or Standby driver

事实上容错功能已成为最受青睐的特性,为了加强容错的功能以及使系统在磁盘故障的情况下能迅速的重建数据,以维持系统的性能,一般的系统都可使用热备份(hot spare or hot standby driver)的功能,所谓热备份是在建立(configure) 系统的时候,将其中一磁盘指定为后备磁盘,此一磁盘在平常并不操作,但若阵列中某一磁盘发生故障时,即以后备磁盘取代故障磁盘,并自动将故障磁盘的数据重建(rebuild)在后备磁盘之上,因为反应快速,加上快取内存减少了磁盘的存取, 所以数据重建很快即可完成,对系统的性能影响很小。对于要求不停机的大型数据处理中心或控制中心而言,热备份更是一项重要的功能,因为可避免晚间或无人值守时发生磁盘故障所引起的种种不便。

另一个额外的容错功能是坏扇区转移(bad sector reassignment)。坏扇区是磁盘故障的主要原因,通常磁盘在读写时发生坏扇区的情况即表示此磁盘故障,不能再作读写,甚至有很多系统会因为不能完成读写的动作而死机,但若因为某一扇区的损坏而使工作不能完成或要更换磁盘,则使得系统性能大打折扣,而系统的维护成本也未免太高了。坏扇区转移是当系统发现磁盘有坏扇区时,以另一空白且无故障的扇区取代该扇区, 以延长磁盘的使用寿命,减少坏磁盘的发生率以及系统的维护成本。所以坏扇区转移功能使具有更好的容错性,同时使整个系统有最好的成本效益比。其他如可外接电池备援的快取内存,以避免突然断电时数据尚未写回磁盘而损失;或在RAID 1时作写入一致性的检查等,虽是小技术,但亦不可忽视。


3.硬件还是软件

市面上有所谓硬件与软件之分,因为软件是使用一块SCSI卡与磁盘连接,一般用户误以为是硬件。以上所述主要是针对硬件,其与软件有几个最大的区别:

l 一个完整的硬件与系统相接。
l 内置CPU,与主机并行运作,所有的I/O都在中完成,减轻主机的工作负载, 增加系统整体性能。
l 有卓越的总线主控(bus mastering)及DMA(Direct Memory Access)能力,加速数据的存取及传输性能。
l 与快取内存结合在一起,不但增加数据的存取及传输性能,更因减少对磁盘的存取而增加磁盘的寿命。
l 能充份利用硬件的特性,反应快速。

软件是一个程序,在主机执行,透过一块SCSI卡与磁盘相接形成阵列,它最大的优点是便宜,因为没有硬件成本(包括研发、生产、维护等),而SCSI卡很便宜(亦有的软件使用指定的很贵的SCSI卡);它最大的缺点是使主机多了很多进程(process),增加了主机的负担,尤其是输出入需求量大的系统。目前市面上的
系统大部份是硬件,软件较少。


4.卡还是控制器

控制卡一般用于小系统,供单机使用。与主机共用电源,在关闭主机电源时存在丢失Cache中的数据的的危险。控制卡只有常用总线方式的接口,其驱动程序与主机、主机所用的操作系统都有关系,有软、硬件兼容性问题并潜在地增加了系统的不安定因素。在更换卡时要冒磁盘损坏,资料失落,随时停机的风险。

独立式控制一般用于较大型系统,可分为两种:
单通道和多通道式,单通道只能接一台主机,有很大的扩充限制。多通道可接多个系统同时使用,以群集(cluster)的方式共用,这使内接式阵列控制及单接式无用武之地。目前多数独立形式的子系统,其本身与主机系统的硬件及操作环境?
--


首先,IDE的性能不会比SCSI更高的。特别是在多任务的情况下。一般广告给出的是
最大传送速度,并不是工作速度。同一时期的IDE与SCSI盘相比,主要是产量比较大,
电路比较简单,所以价格比SCSI低很多,但要比性能,则差远了。

RAID并没有限制使用多少个盘,应时盘越多越好。
对于SCSI结构的RAID来说,盘的最大数量与SCSI通道(SCSI总线)的数量有关一般是每个通道最多装15个盘(SCSI/3)对于FC-AL(光纤)则是每个通道200个盘当然,要有这样大的磁盘箱才行!

 

张志强
2007-03-29

在win32下编译openssl

1.         问题的提出

OpenSSL是一个著名的加密解决方案。很多时候我们都需要在自己的工程里部分的用到相关的工具,例如计算md5值等。这对多数不熟悉LinuxWindows程序员是一个挑战。因为OpenSSL并不提供二进制版本下载,所以在使用前必须自己编译出libdll文件。在Linux下安装倒是很方便,相信只要有过Linux编程经验的人都能完成OpensslLinux下的安装。这里只是我的编译过程,具体请参见:install.w32文件

2.         我的环境

操作系统:Windows XP sp2

编译器:vs2003

3.         下载最新版本的openssl源代码 网址为:http://www.openssl.org/source

4.         下载ActivePerl http://www.activestate.com/Products/Download/Download.plex?id=ActivePerl 并安装

5.         设置路径,将下列路径添加到path环境变量当中,我的程序全安装在C盘,请根据实际的安装路径(这一步也许不是必需的)

C:\Program Files\Microsoft Visual Studio .NET 2003\Vc7\bin

6.         运行程序"C:\Program Files\Microsoft Visual Studio .NET 2003\Common7\Tools\vsvars32bat"。这一步很关键

7.         按照下面的顺序运行:

perl Configure VC-WIN32

ms\do_masm

如果你不需要汇编支持的话,请运行

ms\do_ms

nmake -f ms\ntdll.mak

8.         好了,编译好的东东位于out32dll目录下

 

张志强
2007-03-29

统一威胁管理(UTM)

统一威胁管理(UTM)是近年来在媒体上出现频率较高的一个词:它表示在一个硬件平台上整合各种安全功能,如防火墙、VPN、网关防病毒、入侵检测、入侵阻断、流量分析、内容过滤、3A认证等,它的出现在于一些中小企业用户缺乏安全技术人员,希望以在网关处的一个硬件设备,一揽子解决所有的安全问题。IDC在一份调查报告上称,UTM将是增长最快的信息安全产品。国际最早推出UTM设备的厂商有Fortinet、WatchGuard等公司。
 

张志强
2007-03-29

The structure of the Reiser file system

The Reiser file system was created by Hans Reiser. The design objectives were to increase performance over the ext2 file system, offer a space efficient file system, and to improve handling of large directories compared to existing file systems. Reiserfs uses balanced trees to store files and directories and it also offers journaling.

This document describes the on-disk structure of the Reiser file system version 3.6. This document does not describe how the file system tree is balanced, how the journaling is performed, or how files and directories are managed within an implementation of the file system.

Blocks

The reiserfs partition is divided into blocks of a fixed size. The blocks are numbered sequentially starting with block 0. There is a maximum number of 2^32 possible blocks in one partition.

The partition starts with the first 64k unused to leave enough room for partition labels or boot loaders. After that follows the superblock. The superblock contains important information about the partition such as the block size and the block numbers of the root and journal nodes. The superblock block number differs depending on the block size, but always starts at byte 65536 of the partition. The default block size for reiserfs under Linux is 4096 bytes. This makes the superblock block number 16. There is only one instance of the superblock for the entire partition.

Directly following the superblock is a block containing a bitmap of free blocks. The number of blocks mapped in the bitmap depends directly on the block size. If a bitmap can map k blocks, then every k-th block will be a new bitmap block.

Block size 4,096 512 1,024 8,192
#blocks in bitmap 32,768 4,096 8,192 65,536
superblock 16 128 64 8
1st bitmap 17 129 65 9
2nd bitmap 32,768 4,096 8,192 65,536
3rd bitmap 65,536 8,192 16,384 131,072
4th bitmap 98,304 12,288 24,576 196,608
...
(assuming that the partition is large enough to have 2nd, 3rd, 4th bitmap)

Following the first bitmap block should be the journal, but the information in the superblock is the authorative source for that information.

The Superblock

The superblock layout
Name Size Description
Block count 4 The number of blocks in the partition
Free blocks 4 The number of free blocks in the partition
Root block 4 The block number of the block containing the root node
Journal block 4 The block number of the block containing the first journal node
Journal device 4 Journal device number (not sure what for)
Orig. journal size 4 Original journal size. Needed when using partition on systems with different default journal sizes.
Journal trans. max 4 The maximum number of blocks in a transaction
Journal magic 4 A random magic number
Journal max batch 4 The maximum number of blocks in a transaction
Journal max commit age 4 Time in seconds of how old an asynchronous commit can be
Journal max trans. age 4 Time in seconds of how old a transaction can be
Blocksize 2 The size in bytes of a block
OID max size 2 The maximum size of the object id array
OID current size 2 The current size of the object id array
State 2 State of the partition: valid (1) or error (2)
Magic string 12 The reiserfs magic string, should be "ReIsEr2Fs"
Hash function code 4 The hash function that is being used to sort names in a directory
Tree Height 2 The current height of the disk tree
Bitmap number 2 The amount of bitmap blocks needed to address each block of the file system
Version 2 The reiserfs version number
Reserved 2  
Inode Generation 4 Number of the current inode generation.

The inode generation number is a counter that denotes the current generation of inodes. The counter is increased every time the tree gets re-balanced.

Example:

The following is the start of the superblock of a 256MB reiserfs partition on an Intel based system:

00000000 66 00 01 00 93 18 00 00 82 40 00 00 12 00 00 00  f........@......  00000010 00 00 00 00 00 20 00 00 00 04 00 00 ac 34 11 57  ..... ......¬4.W  00000020 84 03 00 00 1e 00 00 00 00 00 00 00 00 10 cc 03  ..............Ì.  00000030 08 00 02 00 52 65 49 73 45 72 32 46 73 00 00 00  ....ReIsEr2Fs...  00000040 03 00 00 00 04 00 03 00 02 00 00 00 dc 52 00 00  ............ÜR..  
An example superblock
Block count: 65638
Free blocks: 6291
Root block: 16514
Journal block: 18
Journal device: 0
Original journal size: 8192
Journal trans. max: 1024
Journal magic: 1460745388
Journal max. batch: 900
Journal max. commit age: 30
Journal max. trans. age: 0
Blocksize: 4096
OID max. size: 972
OID current size: 8
State: 2 (error)
Magic String: ReIsEr2Fs
Hash function code: 3
Tree height: 4
Bitmap number: 3
Version: 2
Inode generation: 21212

Bitmap blocks

The bitmap blocks are simple bitmaps, where every bit stands for a block number. One bitmap block can address (8 * block size) number of blocks. Byte 0 of the bitmap maps to the first eight blocks, the second byte to the next eight, and so on. Within a byte, the low order bits map to the the lower number blocks. Bit 0 maps to the first block, bit 1 to the second, etc. A set bit indicates that the block is in use, a zero bit that the block is free.

Example:

00000400 ff ff f7 ff 7f 00 00 00 00 00 00 00 00 80 cb bd  ÿÿ÷ÿ..........˽  
These 16 bytes of bitmap block 0 describe block numbers 8192 to 8319.

Blocks 8192-8210: used
Block 8211: free (f7 is 11110111 binary)
Blocks 8212-8230: used
Blocks 8231-8302: free
Blocks 8303-8305: used
Block 8306: free
Block 8307: used
Blocks 8308-8309: free
Blocks 8310-8312: used
Block 8313: free
Blocks 8314-8317: used
Block 8318: free
Block 8319: used

Had the above entry been from a bitmap block other than bitmap block 0, then (bitmap block # * block size * 8) needs to be added for the proper block number. By bitmap block # we understand the ordinal number (0 for the 1st, 1 for the second, ...) not the block number of the bitmap block.

Given a block number b, one can determine its status as follows:

b div (8 * block size) : bitmap block # (integer division)

Let r = b mod (8* block size), then

r div 8: byte within bitmap block, and
r mod 8: bit within byte

The File System Tree

The Reiser file system is made up of a balanced tree (B+ or S+ tree as it is called in the reiserfs documentation). The tree is composed of internal nodes and leaf nodes. Each node is a disk block. Each object (called an item) in reiserfs (file, directory, or stat item) is assigned a unique key, which can be compared to an inode node number in other file systems. The internal nodes are mainly composed of keys and pointers to their child nodes. There is always one more pointer than there are keys. P0 points to the objects that have keys smaller than K0, P1 to those K0<=obj For our example partition, part of the S+ tree looks like this (think of the key as a large 128-bit number for now):

The reiserfs S+-tree

Block headers

Each disk block that belongs to an internal or leaf node starts with a block header. Only unformatted blocks don't have a block header. A block header is always 24 bytes long and contains the following information:

The block header structure

Name Size Description
Level 2 level of the block in the tree
Nr. of items 2 number of items in the block
Free space 2 free space left in the block
Reserved 2  
Right key 16 right delimiting key for the block

The right delimiting key was originally used for leaf nodes but is now only kept for compatibility.

Example:

The following is the block header of block 8416, the leftmost leaf node in the tree.

00000000 01 00 06 00 e4 04 00 00 00 00 00 00 00 00 00 00  ....ä...........  00000010 00 00 00 00 00 00 00 00  
Example of a block header

Level: 1
Items: 6
Free space: 1252 bytes

Keys

Keys are used in the Reiser file system to uniquely identify items, but also to locate them in the tree and achieve local groupings of items that belong together. A key consists of four objects: the directory id, the object id, the offset within the object, and a type. Note that the actual object identifier is only one part of the key. The directory id is present so that files that belong into the same directory are grouped together and for the most part are located in the same subtree(s). The offset is present because an indirect item can at most contain (blocksize-48)/4 pointers to unformatted blocks (see indirect items below). For a block size of 4096 bytes this would result in a maximum file size of 4048KB. To be able to handle larger files, multiple keys are used to reference the file. All fields of the key are the same, except for the offset, which denotes the offset in bytes of the file, which a particular key references. I do not know why the type of an object is part of the actual key.

In reiserfs up until version 3.5 the offset and the type fields were both 4 byte values. This meant, that the maximum file size was limited to roughly 2^32 bytes, or 4GB (2^32 bytes plus the data of one more indirect item plus the tail, actually). To increase the maximum file size in the file system, in version 3.6, the offset field was increased to 60 bits, and the type field shrunk to 4 bits. This now allows for a theoretical maximum file size of 2^60 bytes, but since there can be only 2^32 blocks with a maximum of 2^16 bytes per block, the file system itself only supports 2^48 bytes.

In order not to be incompatible to older versions of the file system, there are now to different versions of keys around, which can be very confusing as the key itself doesn't carry a version number. To make up for this, the formerly reserved last 16 bits of the item header now contain a version number, so if necessary, the key's version number can be obtained from there. This makes it fairly straightforward for keys contained in leaf nodes, but if one really wanted to determine the version of a key inside an internal node, one would have to follow the tree down to the leaf, first. The code in the reiserfs library actually uses this ugly hack to determine the key format:

static inline int is_key_format_1 (int type) {      return ( (type == 0 || type == 15) ? 1 : 0);  }    /* old keys (on i386) have k_offset_v2.k_type == 15 (direct and     indirect) or == 0 (dir items and stat data) */    /* */  int key_format (const struct key * key)  {      int type;        type = get_key_type_v2 (key);        if (is_key_format_1 (type))          return KEY_FORMAT_1;        return KEY_FORMAT_2;  }  
This actually implies that stat items will always be assumed to have KEY_FORMAT_1, because they, also, have a type of zero in version 2. Key of version 1

Name Size Description
Directory ID 4 the identifier of the directory where the object is located
Object ID 4 the actual identifier of the object ("inode number")
Offset 4 the offset in bytes that this key references
Type 4 the type of item. Possible values are:
Stat: 0
Indirect: 0xfffffffe
Direct: 0xffffffff
Directory: 500
Any: 555

Key of version 2

Name Size Description
Directory ID 4 the identifier of the directory where the object is located
Object ID 4 the actual identifier of the object ("inode number")
Offset 60 bits the offset in bytes that this key references
Type 4 bits the type of item. Possible values are:
Stat: 0
Indirect: 1
Direct: 2
Directory: 3
Any: 15

Only stat items have an offset of 0. Files (direct and indirect items) and directories always start with an offset of 1 so that they are sorted behind the stat item in the leaf nodes. For directory items the "offset" field contains the hash value and generation number of the leftmost directory header of the directory item (see below), not the offset in bytes.

Examples:

The following shows the first two keys of the internal node that is contained in block 8482. The first one is of version 2, the second of version 1.

00000000 02 00 00 00 0e 00 00 00 00 00 00 00 00 00 00 00  ................  
Example of a key of version 2

Directory id: 2
Object id: 14
Offset: 0
Type: Stat item (0)

00000000 03 00 00 00 04 00 00 00 01 00 00 00 f4 01 00 00  ............ô...  
Example of a key of version 1

Directory id: 3
Object id: 4
Offset: 1
Type: Directory item (500)

Two keys are compared by comparing their directory ids first, and if those are equal, by comparing the object ids, and so on for offset and type. The fact that the Linux reiserfs code generates a warning when the type fields need to be compared for keys stored in memory indicates that the type field does not matter from a structural point of view. The only time the field needs to be compared seems to be during "tail conversion", where a direct item is changed into an indirect one.

Internal nodes

An internal node block consists of the block header, keys, and pointers to child nodes. Other than the figure of the S+-tree above, the internal nodes have all the keys first, which are sorted by the key values. Then following the last key comes the pointers, starting with the pointer to the subtree containing all the keys smaller to the first key.

Internal node layout

The level in the block header should always be larger than 1 for internal nodes. The number of items in the block header denotes the number of keys in the node, not the combined number of keys and pointers. There is always one more pointer than there are keys. The following figure describes the layout of the pointer structure:

Pointer to child node

Given a key n (whose position in the block is 24 + n * 16 bytes) and a total number of k keys in the block, the left pointer that corresponds to key n can be found at byte 24 + k * 16 + n * 8. The free space starts at byte blocksize - free space, where free space is the value from the block header.

Example:

00000000 02 00 a0 00 e0 00 00 00 00 00 00 00 00 00 00 00  .. .à...........  00000010 00 00 00 00 00 00 00 00 02 00 00 00 0e 00 00 00  ................  00000020 00 00 00 00 00 00 00 00 03 00 00 00 04 00 00 00  ................  00000030 01 00 00 00 f4 01 00 00 03 00 00 00 9e 04 00 00  ....ô...........  00000040 00 00 00 00 00 00 00 00 04 00 00 00 05 00 00 00  ................  ...  00000a10 01 10 00 00 00 00 00 20 e0 20 00 00 04 0b b4 cc  ....... à ....´Ì  00000a20 03 21 00 00 94 0d 54 c5 0b 21 00 00 e0 0f 2f c5  .!....TÅ.!..à./Å  00000a30 5e 23 00 00 b4 0f f4 ff 60 23 00 00 38 07 a9 ff  ^#..´.ôÿ`#..8.©ÿ  ...  

Level: 2
Nr. items: 160
Free space: 224 bytes

Key 0: {2, 14, 0, 0}
Key 1: {3, 4, 1, 500}
Key 2: {3, 1182, 0, 0}
...
Ptr 0: {8416, 2820}
Ptr 1: {8451, 3479}
Ptr 2: {8459, 4064}
Ptr 3: {9054, 4020}
...

This example shows parts of block 8482, which is also depicted in the diagram describing the S+-tree above. Key 0 starts at byte 24 (0x18), and since there are 160 items in the block, Ptr 0 starts at byte 2584 (0xa18). Note that the reserved parts of the pointers actually contain junk data. The free space starts at byte 3872 (0xf20) and it may also contain junk data.

Leaf nodes

Leaf nodes are found at the lowest level of the S+-tree. Except for indirect items all the data is contained within the leaf nodes. Leaf nodes are made up of the block header, item headers, and items:

Leaf node layout

Note that the free space in the block is located between the last item header and item, and that items are in reverse order. This way, new item headers and items can simply be added without having to rearrange existing items. New headers go after the last header, and new items before the first on-disk item. Also note that items are of variable length.

Item Headers

The item header describes the item it refers to. It contains the key for the item as well as the item's location and size within the leaf node. The type of the item is determined by its key.

Item header layout

Name Size Description
Key 16 The key that belongs to the item
Count 2 The free space in the last unformatted node for an indirect item if this is an indirect item
0xffff for stat and direct items
the number of directory entries for a directory item
Length 2 total size of the item
Location 2 offset to the item body within the block
Version 2 0 for all old items (keys), 1 for new ones
Note that the comments in the structure definition indicate that new items have a version of 2. However, the KEY_FORMAT_3_6 constant is defined as 1 and this is used to set the version.

Example:

The following is the item header for the stat item described by key {2, 14, 0, 0}, which was used earlier as an example of type 2 (version 3.6). It shows that the version is indeed the new version, even though the heuristic above would indicate an old key.

00000000 02 00 00 00 0e 00 00 00 00 00 00 00 00 00 00 00  ................  00000010 ff ff 2c 00 d4 0f 01 00                          ÿÿ,.Ô...  

Example of an item header

Key: {2, 14, 0, 0}
Count: 0xffff
Length: 44 bytes
Location: byte 4052
Version: 1 (3.6)

Items

Items finally contain actual data. There are four types of items: stat items, directory items, direct items, and indirect items. Files are made up of one or more direct or indirect item, depending on the file's size. Every file and directory is preceded by a stat item.
Stat Items
Stat items contain the meta-data for files and directories. Keys belonging to stat items always have an offset and type of 0, so that the stat item key always comes first before the other one(s) belonging to the same "inode number". Due to the same reason that there are two versions of keys, there are also two versions of stat items, as the size field was increased from 32 bits to 64 bits. For some reason, the fields for number of hard links, user id, and group id also were increased from 16 bits to 32 bits, each and other fields were introduced. Thus a stat item of version 3.5 is 32 bytes in size, whereas one of version 3.6 has 44 bytes.

The structure of a stat item of version 1:

Structure of the stat item version 1

Name Size Description
Mode 2 file type and permissions
Num links 2 number of hard links
UID 2 user id
GID 2 group id
Size 4 file size in bytes
Atime 4 time of last access
Mtime 4 time of last modification
Ctime 4 time stat data was last changed
Rdev/blocks 4 Device number /
number of blocks file uses
First dir. byte 4 first byte of file which is stored in a direct item
if it equals 1 it is a symlink
if it equals 0xffffffff there is no direct item.

The structure of a stat item of version 2:

Structure of the stat item version 2

Name Size Description
Mode 2 file type and permissions
Reserved 2  
Num links 4 number of hard links
Size 8 file size in bytes
UID 4 user id
GID 4 group id
Atime 4 time of last access
Mtime 4 time of last modification
Ctime 4 time stat data was last changed
Blocks 4 number of blocks file uses
Rdev/gen/first 4 Device number/
File's generation/
first byte of file which is stored in a direct item
if it equals 1 it is a symlink
if it equals 0xffffffff there is no direct item.

The file mode field identifies the type of the file as well as the permissions. The low 9 bits (3 octals) contain the permissions for world, group, and user, the next 3 bits (from lower to higher) are the sticky bit, the set GID bit, and the set UID bit. The high 4 bits contain the file type. On a Linux system, possible values for the file type are (as defined in stat.h):

Constant Name 16-bit Mask 4-bit value Description
S_IFSOCK 0xc000 12 socket
S_IFLNK 0xa000 10 symbolic link
S_IFREG 0x8000 8 regular file
S_IFBLK 0x6000 6 block device
S_IFDIR 0x4000 4 directory
S_IFCHR 0x2000 2 character device
S_IFIFO 0x1000 1 fifo

Other operating systems might have additional file types. Only regular files and directories have other items associated with the stat item. In all the other cases the stat item makes up the entire file.

The "rdev" field applies to special files that are not regular files (S_IFREG), directories (S_IFDIR), or links (S_IFLNK). In those cases, the field holds the device number (or socket number) belonging to the file. The "generation" field applies to the other cases and denotes the inode generation number for the file/directory/link (see above for superblock inode generation field' description). The "first" field doesn't seem to be used in version 2 anymore.

Example:

The following example shows the stat item denoted by key {2, 14, 0, 0} from the item header example above:

00000000 ff 43 05 00 03 00 00 00 50 00 00 00 00 00 00 00  ÿC......P.......  00000010 00 00 00 00 00 00 00 00 2d 1c 17 3f 34 94 ff 3e  ........-..?4.ÿ>  00000020 34 94 ff 3e 01 00 00 00 00 00 00 00              4.ÿ>........  

Example of an stat item version 2

Mode: 0x43ff -- type: directory, sticky bit set, 777 permissions
Reserved: 5
Num. links: 3
Size: 80 bytes
UID: 0
GID: 0
Atime: Thu Jul 17 16:59:09 2003
Mtime: Sun Jun 29 20:36:52 2003
Ctime: Sun Jun 29 20:36:52 2003
Blocks: 1
First: 0

Directory Items
Directory items describe a directory. If there are too many entries in a directory to be contained in one directory item, it will span across several directory items, using the offset value of the key. Directory items are made up of directory headers and file names. Just like leaf nodes, the free space (if there is any) is located in the middle of the item. The structure of a directory item is as follows:

Structure of a directory item

Directory headers contain an offset, the first two parts of the referenced item's key (directory id and object id), the location of the name within the block, and a status field.

Structure of a directory header

Name Size Description
Offset 4 Hash value and generation number
Dir ID 4 object id of the referenced item's parent directory
Object ID 4 object id of the referenced item
Location 2 offset of name within the item
State 2 bit 0 indicates that item contains stat data (not used)
bit 2 whether entry is visible (bit set) or hidden

The file names are simple zero-terminated ASCII strings. File name entries seem to be 8-byte aligned, but the information in the directory headers should be the authorative source for the start of the name (and implicitly the end by looking at the previous header entry). The "offset" field is aptly misnamed as it contains a hash value of the file name. Bits 7 through 30 of the field contains the actual hash value and bits 0 through 6 a generation number in case two file names within a directory hash to the same value. Bit 31 seems to be unused. The hash value is used to actually search for file and directory names in reiserfs, and the directory items are sorted by the offset value. Three different hash functions are possible: keyed tea hash, rupasov hash, and r5 hash. The purpose of the hash function is to create different values for different strings with as little collisions as possible. In the Linux implementation of reiserfs, the r5 hash seems to be the default.

Example:

The following example is an entire directory item, that belongs to the stat item example from the previous section:

00000000 01 00 00 00 02 00 00 00 0e 00 00 00 48 00 04 00  ............H...  00000010 02 00 00 00 01 00 00 00 02 00 00 00 40 00 04 00  ............@...  00000020 00 6d 6f 73 0e 00 00 00 60 00 00 00 30 00 04 00  .mos....`...0...  00000030 76 69 2e 72 65 63 6f 76 65 72 00 00 00 00 00 00  vi.recover......  00000040 2e 2e 00 00 00 00 00 00 2e 00 00 00 00 00 00 00  ................  

Example of a directory item

Header 0: {hash 0, gen. 1, 2, 14, byte 0x48, 4 (bit 2 set: visible)}
Header 1: {hash 0, gen. 2, 1, 2, byte 0x40, 4 (bit 2 set: visible)}
Header 2: {hash 15130330, gen. 0, 14, 96, byte 0x30, 4 (bit 2 set: visible)}
Name 2: "vi.recover"
Name 1: ".."
Name 0: "."

As one can see, the directory referenced by key {2, 14, 0, 0} consists of 3 entries, which in return have the following keys (all these keys will lead to the stat item for the directory first):

. {2, 14, 0, 0}
.. {1, 2, 0, 0}
vi.recover {14, 96, 0, 0}
Direct Items
Direct items contain the entire file body of small files or the tail of a file. For small files, all the necessary other information can be found in the item header and the corresponding stat item for the file. For the tail of a file, the key for the direct item is the last one for the file.
Indirect Items
In direct items contain pointers to unformatted blocks that belong to a file. Each pointer is 4 bytes long and contains the block number of the unformatted block. An indirect item that takes up an entire leaf node can at most contain (blocksize-48) / 4 pointers (the 48 bytes are for the block and item headers). In a partition with 4096 bytes block size, a single indirect item can at most reference 4145152 bytes (4048 KB: 1012 pointers to 4K blocks). Larger files are composed of multiple indirect items, using the offset value in the key, plus a possible tail.

Structure of an indirect item

The Journal

The journal in reiserfs is a continuous set of disk blocks and it describes transactions made to the file system. Each time the file system is modified in any way, instead of performing the changes directly in the file system, the transactions that belong together (those that need to be atomic so that the file system is in a consistent state) are written into the journal first. At a later point the transactions in the journal will be flushed and, if everything was successful, marked as such.

The journal is of fixed size in the file system. In the 2.4.x Linux implementation the journal size is fixed at 8192 blocks plus one block for the journal header. The journal itself consists of variable-length transactions and a journal header. The journal starts with the list of transactions and the journal header is at the end of the journal. A transaction spans at least three disk blocks and the journal header is exactly one block. The journal is a circular buffer, meaning that once the last block of the journal is reached, it wraps around and uses the first block again.

It can often be read that reiserfs only records the file system meta data in its journal. This is not entirely correct. It is true, that purpose of the journaling is to ensure the integrity of the meta data. However, reiserfs journals entire disk blocks as they have to appear in the file system after the journal transaction is committed. Since directories, stat data and small files are stored directly in the leaf nodes of the tree, some amount of data is also contained in the journal and could be used to reconstruct earlier versions of a file or directory.

The journal layout

Journal Header

The journal header is a single block which describes where the first unflushed transaction can be found in the journal. The journal header is the last block of the journal. In our example the journal's first transaction starts at block 18 and there are 8192 journal blocks. Therefore. the journal header is at block 8210. There are only 12 bytes of information in the journal header. The rest of the block is undefined.

The journal header

Name Size Description
Last flush ID 4 The transaction ID of the last fully flushed transaction
Unflushed offset 4 The offset (in blocks) of the next transaction in the journal
Mount ID 4 The mount ID of the flushed transaction

The transaction pointed to by the offset must have a higher transaction ID or a higher mount ID than the flushed transaction in order to be considered an unflushed transaction. If this is not the case, all transactions are considered flushed and the block pointed to by the offset is used to start recording new journal transactions.

Example:

00000000 e2 74 02 00 24 1c 00 00 1d 01 00 00 12 00 00 00  ât..$...........  

The journal header example

Last flush ID: 160994
Unflushed offset: 7204 blocks
Mount ID: 285

In this example, the first unflushed transaction can be found at block 7222 (since the journal starts at block 18). However, the block found there does not contain a transaction description (see below) and therefore there aren't any unflushed transactions for the partition.

Transactions

Transactions describe changes in the file system. Instead of directly modifying blocks in the file system tree, instead the new or changed blocks are first written into the journal and mapped to their real location in the file system.

A transaction consists of a transaction description block, a list of blocks, and a commit block at the end. All those blocks are contiguous within the journal.

The journal transaction layout

Description block

The description block contains the transaction and mount IDs, the number of blocks in the transaction, a magic number, and the first possible half of mappings.

The journal transaction layout

Name Size Description
Transaction ID 4 The transaction ID
Len 4 Length (in blocks) of the transaction
Mount ID 4 Mount ID of the transaction
Real blocks Block size - 24 Mapping for blocks in transaction
Magic 12 Magic number. Should be "ReIsErLB"

The "Real blocks" field is theoretically dependant on the block size. The first 12 bytes of the block have the IDs and the length, and the last 12 bytes contain the magic string. Everything in between is used for the block mapping. However, in the Linux 2.4.x implementation, the struct for a description block defines

  __u32 j_realblock[JOURNAL_TRANS_HALF];  
where JOURNAL_TRANS_HALF is a constant set to 1018. This means that the blocksize has to be 4096 for journaling to work with reiserfs under Linux!

The actual block mapping is done as follows: The "Real blocks" field is seen as an array that contains for each block in the transaction the actual block number of the block in the file system. If we number every four bytes in the field as r0 through rn, then block 0 of the transaction is how block number r0 needs to look like after flushing the journal. Block 1 of the transaction is block r1, and so on. If the "Real blocks" field of the description block is not large enough, the field in the commit block is used in addition. This limits the maximum number of blocks in one transaction to 2*(blocksize-24)/4. (2036 for a block size of 4K), but the actual limit is set in the superblock.

Commit block

The commit block terminates a transaction. It contains a copy of the transaction ID and the transaction length. There is also a 16 byte field reserved for a digest value at the end of the block, but this is not used currently.

The transaction commit block

Name Size Description
Transaction ID 4 The transaction ID
Len 4 Length (in blocks) of the transaction
Real blocks Block size - 24 Mapping for blocks in transaction
Digest 16 Digest of all blocks in transaction. Not used.

Example:

The following example describes an old transaction in our example partition. The transaction starts in block 7243 (the description block), spans 4 data blocks (7244-7247) and has its commit block at block number 7248. Only the description block is shown, as the other blocks are not relevant for the example.

00000000 1b 6e 02 00 04 00 00 00 1b 01 00 00 90 22 00 00  .n..........."..  00000010 07 f7 00 00 aa 22 00 00 10 00 00 00 00 00 00 00  .÷..ª"..........  00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................  ...  00000ff0 00 00 00 00 52 65 49 73 45 72 4c 42 00 00 00 00  ....ReIsErLB....  

A sample transaction description block

Transaction ID: 159259
Length: 4 blocks
Mount ID: 283
Real blocks[0]: 8848
Real blocks[1]: 63239
Real blocks[2]: 8874
Real blocks[3]: 16
Magic: ReIsErLB

This transaction therefore describes the following mapping: when the transaction is committed/flushed, block 7244 is written to block 8848, block 7245 to block 63239, block 7246 to block 8874, and block 7247 to block 16 (the superblock).

Navigating reiserfs

In addition to the file system tree itself, in order to access files, one needs to navigate through the directory tree, as well. The root directory of a Reiser file system always has the key {1, 2, 0, 0}. The keys for subsequent directories and files within the directory hierarchy can then be found in the headers of the directory items. Since the keys in reiserfs are sorted by parent directory ID first, items that are in the same directory are grouped together in the file system tree. This allows for searching for keys locally instead of always having to go through the root node of the file system.

A key {a, b, 0, 0} will always yield the stat item of the directory or file, and subsequent items will follow immediately after that in the file system tree. The stat item contains the size of the actual item in bytes. With this information and using the size information of the individual item headers, the keys for other parts of the directory/file can be constructed and the parts located. In many cases, the items will be arranged consecutively on the disk, anyway.

The following three examples will show three different types of files: a very small file consisting only of a stat item and a tail, a larger file that actually has an indirect item, and finally a very large file that spans over multiple indirect items. We again use the example partition from above, which is an image of a partition mounted as "/var" in a SuSe Linux 8.0 system.

Example 1: small file

The first example is that of a small file that contains only of a stat item and one direct item. The file is "/var/log/y2start.log-initial". The root directory ("/var") has key {1,2,0,0}, which by navigating the file system tree can be found in block 8416. There we can find that the "log" directory has key {2,13,0,0}. This directory is also contained in block 8416. The file "y2start.log-initial" has key {13, 1633, 0, 0}. By inspecting block 8482, we find that this key is contained in the leaf node block number 24224. The item headers for the keys {13, 1633, 0, 0} and {13, 1633, 1, 2} are as follows:
00000090 0d 00 00 00 61 06 00 00 00 00 00 00 00 00 00 00  ....a...........  000000a0 ff ff 2c 00 a4 0b 01 00 0d 00 00 00 61 06 00 00  ÿÿ,.¤.......a...  000000b0 01 00 00 00 00 00 00 20 ff ff f0 00 b4 0a 01 00  ....... ÿÿð.´...  
Key: {13, 1633, 0, 0}
Count: 0xffff
Length: 44 bytes
Location: byte 2980 (0xba4)
Version: 1 (new)

Key: {13, 1633, 1, 2}
Count: 0xffff
Length: 240 bytes
Location: byte 2740 (0xab4)
Version: 1 (new)

At byte 2740 (0xab4) in the block, we find the direct item for the file followed by the stat item at byte 2980 (0xba4):

00000ab0             65 6e 76 0a 65 63 68 6f 20 59 32 44      env.echo Y2D  00000ac0 45 42 55 47 20 28 29 0a 6d 65 6d 69 6e 66 6f 20  EBUG ().meminfo  00000ad0 31 20 3d 20 4d 65 6d 3a 20 31 30 33 33 34 35 36  1 = Mem: 1033456  00000ae0 20 38 35 39 37 36 20 39 34 37 34 38 30 20 30 20   85976 947480 0  00000af0 36 34 32 34 20 35 37 31 37 32 0a 69 53 65 72 69  6424 57172.iSeri  00000b00 65 73 3d 31 0a 68 76 63 5f 63 6f 6e 73 6f 6c 65  es=1.hvc_console  00000b10 3d 31 0a 58 31 31 69 3d 0a 4d 65 6d 54 6f 74 61  =1.X11i=.MemTota  00000b20 6c 3d 31 30 33 33 34 35 36 0a 66 62 64 65 76 5f  l=1033456.fbdev_  00000b30 6f 6b 3d 31 0a 75 70 64 61 74 65 3d 0a 58 56 65  ok=1.update=.XVe  00000b40 72 73 69 6f 6e 3d 34 0a 58 53 65 72 76 65 72 3d  rsion=4.XServer=  00000b50 66 62 64 65 76 0a 78 73 72 76 3d 58 46 72 65 65  fbdev.xsrv=XFree  00000b60 38 36 0a 73 63 72 65 65 6e 3d 66 62 64 65 76 0a  86.screen=fbdev.  00000b70 6d 65 6d 69 6e 66 6f 20 32 20 3d 20 4d 65 6d 3a  meminfo 2 = Mem:  00000b80 20 31 30 33 33 34 35 36 20 39 32 34 30 34 20 39   1033456 92404 9  00000b90 34 31 30 35 32 20 30 20 38 32 33 32 20 36 30 35  41052 0 8232 605  00000ba0 31 36 0a 00 a4 81 05 00 01 00 00 00 ef 00 00 00  16..¤.......ï...  00000bb0 00 00 00 00 00 00 00 00 00 00 00 00 25 15 3e 3d  ............%.>=  00000bc0 25 15 3e 3d 25 15 3e 3d 08 00 00 00 d5 02 00 00  %.>=%.>=....Õ...  
Mode: S_IFREG (regular file), -rw-r--r--
Num. links: 1
Size: 239
UID: 0
GID: 0
A/M/Ctimes: 07/23/2002 21:47:01
Blocks: 8
Gen: 725

Note that the stat item contains the correct size for the file, 239 bytes. This means that byte 2979 (0xba3) of the block does not belong to the file anymore.

Example 2: file with indirect item

The file "/var/log/SaX.log" is 7121 bytes long. It therefor cannot fit as a direct item and needs to be split either into two unformatted blocks or one unformatted block and a tail. In this case, the file will take up two unformatted blocks described by one indirect item. The key for the file is {13, 1490, 0, 0} and examining block 8482 we find out that it is contained in leaf node block number 27444.
00000040                         0d 00 00 00 d2 05 00 00          ....Ò...  00000050 00 00 00 00 00 00 00 00 ff ff 2c 00 a4 0b 01 00  ........ÿÿ,.¤...  00000060 0d 00 00 00 d2 05 00 00 01 00 00 00 00 00 00 10  ....Ò...........  00000070 00 00 08 00 9c 0b 01 00                          ........          
Key: {13, 1490, 0, 0}
Count: 0xffff
Length: 44 bytes
Location: byte 2980 (0xba4)
Version: 1 (new)

Key: {13, 1490, 1, 1}
Count: 0
Length: 8 bytes
Location: byte 2972 (0xb9c)
Version: 1 (new)

00000b90                                     12 52 00 00              .R..  00000ba0 13 52 00 00 a4 81 05 00 01 00 00 00 d1 1b 00 00  .R..¤.......Ñ...  00000bb0 00 00 00 00 00 00 00 00 00 00 00 00 3f aa 4a 3d  ............?ªJ=  00000bc0 bd aa 4a 3d bd aa 4a 3d 10 00 00 00 54 05 00 00  ½ªJ=½ªJ=....T...  
Mode: S_IFREG (regular file), -rw-r--r--
Num. links: 1
Size: 7121
UID: 0
GID: 0
C time: Fri Aug 2 10:50:23 2002
M/Atimes: Fri Aug 2 10:52:29 2002
Blocks: 10
Gen: 1364
Block 1: 21010
Block 2: 21011

The file is thus made up of the contents of blocks 21010 and 21011. Block 21010 contains a full 4096 bytes of data, whereas block 21011 contains only 3025 bytes. For some reason, though the item header for the indirect item (see above) doesn't contain a count of 1071 bytes as one would have expected.

Example 3: a large file

The file "/var/lib/rpm/fileindex.rpm" is a file of over 11 MB in size. A single indirect item can not describe the file, as there isn't enough space in a block for such a large indirect item. The file has the key {4, 7, 0, 0}, which can be found in block 16822. This block, however, contains only the stat item for the file. The indirect items for the file span over three more blocks: Key {4, 7, 1, 1} is in block 13286, key {4, 7, 4145153, 1} in block 20171, and key {4, 7, 8290305, 1} in block 20987. Block 13286 contains one single indirect item:
00000010                         04 00 00 00 07 00 00 00          ........  00000020 01 00 00 00 00 00 00 10 00 00 d0 0f 30 00 01 00  ..........Ð.0...  
Key: {4, 7, 1, 1}
Count: 0
Length: 4048 bytes
Location: byte 48 (0x30)
Version: 1 (new)

What follows are 1012 pointers to unformatted blocks. Block 20171 has the same structure. Block 20987 also holds just one indirect item, but uses only 3320 bytes for 830 pointers. Note how the offset for the next key derives directly from offset of the previous key and the number of pointers in the previous indirect item:

1 + (1012 pointers * 4096 bytes blocksize) = 4145153
4145153 + (1012 pointers * 4096 bytes blocksize) = 8290305

 

张志强
2007-03-29

Ntfs5.0的文件系统新特性

 
1. NTFS5.0可以支持的分区(如果采用动态磁盘则称为卷)大小可以达到2TB。而Win2000中的FAT32支持分区的大小最大为32GB。
  2.NTFS5.0是一个可恢复的文件系统。在NTFS5.0分区上用户很少需要运行磁盘修复程序。NTFS通过使用标准的事物处理日志和恢复技术来保证分区的一致性。发生系统失败事件时,NTFS使用日志文件和检查点信息自动恢复文件系统的一致性。

  3.NTFS5.0支持对分区、文件夹和文件的压缩。任何基于Windows的应用程序对NTFS分区上的压缩文件进行读写时不需要事先由其他程序进行解压缩,当对文件进行读取时,文件将自动进行解压缩;文件关闭或保存时会自动对文件进行压缩。

  4. NTFS5.0采用了更小的簇,可以更有效率地管理磁盘空间。在Win 2000的FAT32文件系统的情况下,分区大小在2GB~8GB时簇的大小为4KB;分区大小在8GB~16GB时簇的大小为8KB;分区大小在16GB~32GB时,簇的大小则达到了16KB。而Win 2000的NTFS文件系统,当分区的大小在2GB以下时,簇的大小都比相应的FAT32簇小;当分区的大小在2GB以上时(2GB~2TB),簇的大小都为4KB。相比之下,NTFS可以比FAT32更有效地管理磁盘空间,最大限度地避免了磁盘空间的浪费。

  5.在NTFS5.0分区上,可以为共享资源、文件夹以及文件设置访问许可权限。许可的设置包括两方面的内容:一是允许哪些组或用户对文件夹、文件和共享资源进行访问;二是获得访问许可的组或用户可以进行什么级别的访问。访问许可权限的设置不但适用于本地计算机的用户,同样也应用于通过网络的共享文件夹对文件进行访问的网络用户。另外,在采用NTFS格式的Win2000中,应用审核策略可以对文件夹、文件以及活动目录对象进行审核,审核结果记录在安全日志中,通过安全日志就可以查看哪些组或用户对文件夹、文件或活动目录对象进行了什么级别的操作,从而发现系统可能面临的非法访问,通过采取相应的措施,将这种安全隐患减到最低。甚至可能为每一个文件加密。可能有人会说在NT4.0中对用户设置许可就能实现这个功能。Ntfs5.0的加密文件系统其实不是一种文件系统,而是NTFS中的一个新的特性。它用一个随机产生的密钥把一个文件加密,只有文件的所有者和管理员掌握解密的密钥,其它人即使能够登录到系统中,也没有办法读取它。但是在NT4.0中,文件本身是没有加密的,如果一个用户想要读取一个他没有访问权限的文件的话,他只要在硬盘上安装另一套NT就可以了。但是在NTFS5。0下,由于文件是加密存储的,用户即使安装另外一套Windows2000,他也没有办法得到解密的密钥,因此加密文件系统的安全性更高。

  6.在Win2000的NTFS文件系统下可以进行磁盘配额管理。磁盘配额就是管理员可以为用户所能使用的磁盘空间进行配额限制,每一用户只能使用最大配额范围内的磁盘空间。设置磁盘配额后,可以对每一个用户的磁盘使用情况进行跟踪和控制,通过监测可以标识出超过配额报警阈值和配额限制的用户,从而采取相应的措施。磁盘配额管理功能的提供,使得管理员可以方便合理地为用户分配存储资源,避免由于磁盘空间使用的失控可能造成的系统崩溃,提高了系统的安全性。

  7. NTFS5.0使用一个"变更"日志来跟踪记录文件所发生的变更。

  8.NTFS5.0支持动态的分区,也就是可以在线地改变分区的大小,不用退出系统,也不用格式化和重新启动。此外,如果有一个分区包含重要的文件信息,您可以为这个分区动态地创建镜像分区,在这个过程中,用户可以照常地在这个分区中进行文件读写,不会感到有任何的异常。当今后不再需要这个镜像的时候,又可以把这个镜像在线地取消掉。

 


张志强
2007-03-29