Linux inode cache 机制分析

probing · 发表于 2003-2-16 23:03:44

发言者:阿水发言时间:2003-01-27 09:45:06

作者：詹荣开

(詹荣开 zhanrk@sohu.com)

Linux inode cache机制实现在fs／inode.c文件中。

1．1．Inode的slab分配器缓存
索引节点缓存（inode cache，简称icache）机制的实现是以inode对象的slab分配器缓存为基础的，因此要从物理内存中申请或释放一个inode对象，都必须通过kmem_cache_alloc()函数和kmem_cache_free()函数来进行。
Inode对象的slab分配缓存由一个kmem_cache_t类型的指针变量inode_cachep来定义。这个slab分配器缓存是在inode cache的初始化函数inode_init（）中通过kmem_cache_create()函数来创建的。
Linux在inode.c文件中又定义了两个封装函数，来实现从inode_cachep slab分配器缓存中分配一个inode对象或将一个不再使用的inode对象释放给slab分配器，如下所示：
#define alloc_inode() \
((struct inode *) kmem_cache_alloc(inode_cachep, SLAB_KERNEL))
static void destroy_inode(struct inode *inode)
{
if (!list_empty(&inode->i_dirty_buffers))
BUG();
kmem_cache_free(inode_cachep, (inode));
}

1．2 和inode对象相关的一些底层操作
源文件inode.c中实现了一些对inode对象的底层基本操作，如下：
（1）clean_inode()——初始化部分inode对象成员域
该函数用来将一个刚从inode_cachep slab分配器中分配得到的inode对象中的某些成员初始化为已知的值（通常为0），但是有一个例外，既链接数i_nlink被初始化为1。这是一个静态的静态内部函数，因此它只能被inode.c中的其他函数所调用，如：get_empty_inode()和get_new_inode()。
/*
* This just initializes the inode fields
* to known values before returning the inode..
*
* i_sb, i_ino, i_count, i_state and the lists have
* been initialized elsewhere..
*/
static void clean_inode(struct inode *inode)
{
static struct address_space_operations empty_aops;
static struct inode_operations empty_iops;
static struct file_operations empty_fops;
memset(&inode->u, 0, sizeof(inode->u));
inode->i_sock = 0;
inode->i_op = &empty_iops;
inode->i_fop = &empty_fops;
inode->i_nlink = 1; ／＊ NOTE！i_nlink被初始化为1 ＊／
atomic_set(&inode->i_writecount, 0);
inode->i_size = 0;
inode->i_generation = 0;
memset(&inode->i_dquot, 0, sizeof(inode->i_dquot));
inode->i_pipe = NULL;
inode->i_bdev = NULL;
inode->i_data.a_ops = &empty_aops;
inode->i_data.host = inode;
inode->i_mapping = &inode->i_data;
}
（2）get_empty_inode()——从slab分配器中分配一个空的inode对象
该函数通过调用alloc_inode宏从slab分配器中分配一个inode对象，然后把除了I_count引用计数和链接计数I_nlink之外的所有域都初始化为NULL（部分域的初始化通过调用clean_inode函数来完成），并将这个inode对象链入inode_in_use链表中。最后返回这个inode对象的指针，如下所示：
struct inode * get_empty_inode(void)
{
static unsigned long last_ino;
struct inode * inode;

inode = alloc_inode();
if (inode)
{
spin_lock(&inode_lock);
inodes_stat.nr_inodes++;
list_add(&inode->i_list, &inode_in_use);
inode->i_sb = NULL;
inode->i_dev = 0;
inode->i_ino = ++last_ino;
inode->i_flags = 0;
atomic_set(&inode->i_count, 1);
inode->i_state = 0;
spin_unlock(&inode_lock);
clean_inode(inode);
}
return inode;
}
Linux内核模块通常并不会调用这个函数来分配一个inode对象。那些想获取一个没有索引节点号的inode对象的内核模块（如网络层等），以及那些没有任何已知信息的fs，通常会用这个函数来获取一个新的inode对象。
(3) clear_inode（）——清除一个inode对象中的内容
在调用destroy_inode（）函数释放一个inode对象之前，通常调用该函数来清除该inode对象中内容，如：使inode引用的缓冲区无效、解除对其它对象的引用等。
/**
* clear_inode - clear an inode
* @inode: inode to clear
*
* This is called by the filesystem to tell us
* that the inode is no longer useful. We just
* terminate it with extreme prejudice.
*/
void clear_inode(struct inode *inode)
{
if (!list_empty(&inode->i_dirty_buffers))
invalidate_inode_buffers(inode);

if (inode->i_data.nrpages)
BUG();
if (!(inode->i_state & I_FREEING))
BUG();
if (inode->i_state & I_CLEAR)
BUG();
wait_on_inode(inode);
if (IS_QUOTAINIT(inode))
DQUOT_DROP(inode);
if (inode->i_sb && inode->i_sb->s_op && inode->i_sb->s_op->clear_inode)
inode->i_sb->s_op->clear_inode(inode);
if (inode->i_bdev) {
bdput(inode->i_bdev);
inode->i_bdev = NULL;
}
inode->i_state = I_CLEAR;
}

1．3 icache数据结构
Linux通过在inode_cachep slab分配器缓存之上定义各种双向链表来实现inode缓存机制，以便有效地管理内存inode对象。这些链表包括：正在使用的inode链表、未使用的inode链表、inode哈希链表和匿名inode的哈希链表，他们的定义如下：
static LIST_HEAD(inode_in_use);
static LIST_HEAD(inode_unused);
static struct list_head *inode_hashtable;
static LIST_HEAD(anon_hash_chain); /* for inodes with NULL i_sb */
此外，每个超级块对象super_block中还有一条被修改过的、且正在使用的inode双向链表s_dirty。
每一个inode对象都会存在于两个分离的双向链表中：
（1）一个就是inode哈希链表inode_hashtable，用来加快inode查找，每个inode对象都通过I_hash指针链入哈希链表中。
（2）另一个就是inode的“类型”链表：
l 如果I_count>0、I_nlink>0且该inode不脏，则这个inode就通过其I_list指针链入系统全局的inode_in_use双向链表。
l 如果I_count和I_nlink都大于0，但是这个inode为脏（既I_state域中设置了I_DIRTY标志），则这个inode通过I_list指针链入他所属的super_block对象的s_dirty链表。
l 如果I_count＝0，则通过其I_list链入inode_unused链表。
对于那些不属于任何超级块对象（即I_sb＝NULL）的匿名inode对象，则通过I_hash指针链入系统全局的匿名inode哈希链表anon_hash_chain。

1．3．1 对inode缓存链表的锁保护
Linux在inode.c中定义了自旋锁inode_lock，来实现对所有inode缓存链表的互斥访问。也即，任何访问任意一条inode缓存链表的代码段，都必须通过调用spin_lock()函数持有该自旋锁，并在结束访问后通过spin_unlock()释放该自旋锁。Inode_lock的定义如下：
Spinlock_t inode_lock＝SPIN_LOCK_UNLOCKED；
NOTE！如果要改变一个正在使用的inode对象的I_state域，也必须先持有该自旋锁。

1．3．2 inode缓存的统计信息
全局变量inodes_stat定义了inode cache的统计信息，主要包括cache中的inode对象总数和其中未使用的inode个数，其定义如下：
struct {
int nr_inodes;
int nr_unused;
int dummy[5];
} inodes_stat;

1．3．3 inode哈希链表
inode哈希链表的主要用途是加快在icache中查找一个特定的inode对象。指针inode_hashtable指向一组哈希链表表头，所有哈希函数值（记为h）相同的inode对象都通过I_hash指针作为接口组成双向链表，并挂在inode_hashtable［h］这个哈希链表表头之后。所有哈希链表表头都放在一起组成一个数组，该数组的首地址由指针inode_hashtable所指向。
在Linux中，inode哈希链表表头数组是存放在2order个连续的物理页帧中的，其中，order≥1，且它的值与系统总的物理页帧数num_physpages的值相关。因此，哈希链表表头的个数为：2order＊PAGE_SIZE／sizeof（struct list_head）。由于list_head结构类型的大小是8个字节（2个32位指针），因此：inode哈希链表表头的个数可以表示为：2（order+12-3）。
l 哈希链表的初始化
inode cache的初始化工作是由inode_init()函数来完成的,它主要完成两项工作

1)inode哈希链表的初始化,包括为inode哈希链表表头数组分配物理内存等；（2）创建inode slab分配器缓存。该函数的源代码如下：
/*
* Initialize the hash tables.
*/
void __init inode_init(unsigned long mempages)
{
struct list_head *head;
unsigned long order;
unsigned int nr_hash;
int i;

／＊计算order的值，但是我不知道为什么要这样计算？：）＊／
mempages >>= (14 - PAGE_SHIFT);
mempages *= sizeof(struct list_head);
for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
;

do {
unsigned long tmp;

nr_hash = (1UL << order) * PAGE_SIZE /
sizeof(struct list_head);
i_hash_mask = (nr_hash - 1);

tmp = nr_hash;
i_hash_shift = 0;
while ((tmp >>= 1UL) != 0UL)
i_hash_shift++;

inode_hashtable = (struct list_head *)
__get_free_pages(GFP_ATOMIC, order);
} while (inode_hashtable == NULL && --order >= 0);

printk("Inode-cache hash table entries: %d (order: %ld, %ld bytes)\n",
nr_hash, order, (PAGE_SIZE << order));

if (!inode_hashtable)
panic("Failed to allocate inode hash table\n");

head = inode_hashtable;
i = nr_hash;
do {
INIT_LIST_HEAD(head);
head++;
i--;
} while (i);

/* inode slab cache */
inode_cachep = kmem_cache_create("inode_cache", sizeof(struct inode),
0, SLAB_HWCACHE_ALIGN, init_once,
NULL);
if (!inode_cachep)
panic("cannot create inode slab cache");
}
函数注释如下：
（1）函数首先计算inode哈希链表表头数组所需的连续物理页帧数（最大可能的order值）。
（2）然后在do…while循环中，尝试分配2order个连续的物理页帧。如果分配失败，则将order减1，然后在重新开始分配。如果直到order=0（既1页）时还是分配失败（inode_hashtable＝NULL），则退出do…while循环，并通过panic（）函数终止内核初始化过程。
（3）上述每次do…while循环中，在分配物理内存之前，函数都会首先计算哈希链表表头数组的位掩码（I_hash_mask）和位数（I_hash_shift）。其中，位掩码I_hash_mask＝nr_hash-1，nr_hash是inode哈西链表表头的个数。而I_hash_shift实际上就等于（order＋12－3）。
I_hash_mask和I_hash_shift的定义如下：
#define I_HASHBITS i_hash_shift
#define I_HASHMASK i_hash_mask

static unsigned int i_hash_mask;
static unsigned int i_hash_shift;
（4）在成功分配物理内存之后，函数开始对inode哈希链表表头数组中的表头进行初始化，也即通过INIT_LIST_HEAD宏将每一个表头中的prev、next指针初始化为指向自己。
（5）最后，函数通过调用kmem_cache_create()函数创建inode对象的slab分配器缓存inode_cachep。
l 计算一个inode对象的哈希散列值
Linux唯一确定一个inode对象的值是二元组（设备号，索引节点号），而设备号与超级块对象是一一对应的，因此，哈希散列值计算函数hash以super_block对象和索引节点号I_ino为参数。如下：
static inline unsigned long hash(struct super_block *sb, unsigned long i_ino)
{
unsigned long tmp = i_ino | ((unsigned long) sb / L1_CACHE_BYTES);
tmp = tmp + (tmp >> I_HASHBITS) + (tmp >> I_HASHBITS*2);
return tmp & I_HASHMASK;
}
l 对inode哈希链表的操作函数
与inode哈希链表相关的操作函数有：insert_inode_hash()和remove_inode_hash(),以及find_inode()。
Insert_inode_hash()函数用于将一个未散列的inode对象插入到相应的索引节点哈希链表中：
void insert_inode_hash(struct inode *inode)
{
struct list_head *head = &anon_hash_chain;
if (inode->i_sb)
head = inode_hashtable + hash(inode->i_sb, inode->i_ino);
spin_lock(&inode_lock);
list_add(&inode->i_hash, head);
spin_unlock(&inode_lock);
}
remove_inode_hash()函数用于将一个inode对象从他所属的哈希链表中摘除：
void remove_inode_hash(struct inode *inode)
{
spin_lock(&inode_lock);
list_del(&inode->i_hash);
INIT_LIST_HEAD(&inode->i_hash);
spin_unlock(&inode_lock);
}
而find_inode（）函数则用于在某个链表中查找由（sb,ino）唯一标识的inode对象，如下所示：
/*
* Called with the inode lock held.
* NOTE: we are not increasing the inode-refcount, you must call __iget()
* by hand after calling find_inode now! This simplifies iunique and won't
* add any additional branch in the common code.
*/
static struct inode * find_inode(struct super_block * sb, unsigned long ino, struct list_head *head, find_inode_t find_actor, void *opaque)
{
struct list_head *tmp;
struct inode * inode;

tmp = head;
for (;;) {
tmp = tmp->next;
inode = NULL;
if (tmp == head)
break;
inode = list_entry(tmp, struct inode, i_hash);
if (inode->i_ino != ino)
continue;
if (inode->i_sb != sb)
continue;
if (find_actor && !find_actor(inode, ino, opaque))
continue;
break;
}
return inode;
}
NOTE! 调用find_inode（）函数时一定要持有inode_lock锁。

1．4 icache中的inode对象存取接口——iget／iput

1．4．1 iget()——引用一个inode对象
其他内核模块要想访问一个inode对象，必须通过iget()访问接口。该函数会首先在icache的哈希链表中查找相应的inode对象，如果找到，则将该对象的引用计数加1，然后返回该inode对象的指针。如果没有找到，则通过调用get_new_inode（）函数分配一个新的inode对象。该函数的源代码如下所示（include/linux/fs.h）：
static inline struct inode *iget(struct super_block *sb, unsigned long ino)
{

probing · 发表于 2003-2-16 23:04:41

return iget4(sb, ino, NULL, NULL);
}
可以看出，实际的工作是由iget4()函数（带有4个参数的iget函数，所以叫iget4）来完成的，其源码如下（inode.c）：
struct inode *iget4(struct super_block *sb, unsigned long ino, find_inode_t find_actor, void *opaque)
{
struct list_head * head = inode_hashtable + hash(sb,ino);
struct inode * inode;

spin_lock(&inode_lock);
inode = find_inode(sb, ino, head, find_actor, opaque);
if (inode) {
__iget(inode);
spin_unlock(&inode_lock);
wait_on_inode(inode);
return inode;
}
spin_unlock(&inode_lock);

/*
* get_new_inode() will do the right thing, re-trying the search
* in case it had to block at any point.
*/
return get_new_inode(sb, ino, head, find_actor, opaque);
}
NOTE:
（1）首先，通过散列函数hash找到该inode对象应该在那个哈希链表中，然后调用find_inode（）函数在该哈希链表中查找是否存在该inode对象。。
（2）如果在哈希链表中找到了相应的inode对象（由超级块对象指针sb和索引节点号ino唯一确定），则先调用__iget（）函数增加该inode对象的引用计数。然后调用wait_on_inode（）函数等待该inode对象被其他内核模块解锁。最后，直接返回这个inode对象的指针。
（3）如果没有找到，则调用get_new_inode()函数分配一个新的inode对象。
内联函数__iget()用来增加一个inode对象的引用计数。该函数首先判断inode对象的I_count的当前值。如果它非0，则将I_count加1后就直接返回。如果它为0，则先将I_count加1，然后就看看这个inode对象当前是否为脏，如果为脏（I_state&I_DIRTY非0），那就什么也不做，让该inode对象继续待在超级块对象的s_dirty链表中。如果不为脏，则将该inode对象从原来的“类型”链表（应该在inode_unused链表中）中删除，并将其连入到inode_in_use链表中。最后将icache统计信息中的未使用inode个数减1。
static inline void __iget(struct inode * inode)
{
if (atomic_read(&inode->i_count)) {
atomic_inc(&inode->i_count);
return;
}
atomic_inc(&inode->i_count);
if (!(inode->i_state & I_DIRTY)) {
list_del(&inode->i_list);
list_add(&inode->i_list, &inode_in_use);
}
inodes_stat.nr_unused--;
}
wait_on_inode()函数测试一个inode对象是否已经被加锁，如果是，则调用__wait_on_inode()函数等待该inode被解锁。否则就直接返回：
static inline void wait_on_inode(struct inode *inode)
{
if (inode->i_state & I_LOCK)
__wait_on_inode(inode);
}
get_new_inode()函数用于得到一个新的inode对象。该函数首先调用alloc_inode宏从inode_cachep这个slab分配器缓存中分配一个新的inode对象。如下所示：
static struct inode * get_new_inode(struct super_block *sb, unsigned long ino, struct list_head *head, find_inode_t find_actor, void *opaque)
{
struct inode * inode;

inode = alloc_inode();
if (inode) {
struct inode * old;

spin_lock(&inode_lock);
/* We released the lock, so.. */
old = find_inode(sb, ino, head, find_actor, opaque);
if (!old) {
inodes_stat.nr_inodes++;
list_add(&inode->i_list, &inode_in_use);
list_add(&inode->i_hash, head);
inode->i_sb = sb;
inode->i_dev = sb->s_dev;
inode->i_ino = ino;
inode->i_flags = 0;
atomic_set(&inode->i_count, 1);
inode->i_state = I_LOCK;
spin_unlock(&inode_lock);

clean_inode(inode);
sb->s_op->read_inode(inode);

/*
* This is special! We do not need the spinlock
* when clearing I_LOCK, because we're guaranteed
* that nobody else tries to do anything about the
* state of the inode when it is locked, as we
* just created it (so there can be no old holders
* that haven't tested I_LOCK).
*/
inode->i_state &= ~I_LOCK;
wake_up(&inode->i_wait);

return inode;
}

/*
* Uhhuh, somebody else created the same inode under
* us. Use the old inode instead of the one we just
* allocated.
*/
__iget(old);
spin_unlock(&inode_lock);
destroy_inode(inode);
inode = old;
wait_on_inode(inode);
}
return inode;
}
（1）由于在调用get_new_inode函数时，调用者已经释放了inode_lock锁，因此在inode_lock被释放到调用get_new_inode函数期间，其他内核模块有可能已经创建了（sb,ino）所确定的索引节点。所以，get_new_inode函数必须重新调用find_inode函数，以再一次在哈希链表中查找是否已存在相应的inode对象。
（2）如果find_inode（）返回一个有效的指针，则说明其它模块已经创建了（sb,ino）这个inode对象，因此就要调用destroy_inode（）函数来销毁一开始创建的inode对象，然后调用__iget()函数增加所找到的inode对象的引用计数，然后用wait_on_inode（）函数等待他被解锁。
（3）如果find_inode()返回NULL，则对先前所分配的inode对象进行初始化。其中，要调用clean_inode（）函数和超级块的s_op->read_inode()方法，以从块设备上读取磁盘索引节点的内容。最后，返回所分配的inode对象的指针。

1．4．2 iput（）函数——释放对一个inode对象的引用
当其他内核对象或模块不再需要一个inode对象时，必须通过iput（）函数，来解除对该inode对象的引用。
iput()函数将inode对象的引用计数减1。如果减到了0，则将该inode释放。否则就直接返回。
/**
* iput - put an inode
* @inode: inode to put
*
* Puts an inode, dropping its usage count. If the inode use count hits
* zero the inode is also then freed and may be destroyed.
*/

void iput(struct inode *inode)
{
if (inode) {
struct super_operations *op = NULL;

if (inode->i_sb && inode->i_sb->s_op)
op = inode->i_sb->s_op;
if (op && op->put_inode)
op->put_inode(inode);

if (!atomic_dec_and_lock(&inode->i_count, &inode_lock))
return;

if (!inode->i_nlink) {
list_del(&inode->i_hash);
INIT_LIST_HEAD(&inode->i_hash);
list_del(&inode->i_list);
INIT_LIST_HEAD(&inode->i_list);
inode->i_state|=I_FREEING;
inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);

if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);

if (op && op->delete_inode) {
void (*delete)(struct inode *) = op->delete_inode;
/* s_op->delete_inode internally recalls clear_inode() */
delete(inode);
} else
clear_inode(inode);
if (inode->i_state != I_CLEAR)
BUG();
} else {
if (!list_empty(&inode->i_hash)) {
if (!(inode->i_state & I_DIRTY)) {
list_del(&inode->i_list);
list_add(&inode->i_list,
&inode_unused);
}
inodes_stat.nr_unused++;
spin_unlock(&inode_lock);
return;
} else {
/* magic nfs path */
list_del(&inode->i_list);
INIT_LIST_HEAD(&inode->i_list);
inode->i_state|=I_FREEING;
inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);
clear_inode(inode);
}
}
destroy_inode(inode);
}
}
对该函数的注释如下：
（1）函数首先调用超级块对象中的s_op->put_inode（）方法（如果有的话）来释放这个inode对象。
（2）通过atomic_dec_and_lock原子操作将I_count减1，并同时对inode_lock进行加锁。如果I_count在减1后还为非0值，则直接返回。否则就进行下面的处理。
（3）如果该inode的i_nlink值为0，则说明文件系统中已经没有任何文件链接到这个inode对象上。对于这种情况，iput()先将这个inode从其所属的哈希链表和“类型”链表中摘除。然后在其i_count中设置I_FREEING标志，表示正在释放这个inode对象。最后，看看是否定义了超级块对象方法s_op->delete_inode()，如果有则调用该方法删除相应的磁盘索引节点，如果未定义该方法，则调用clear_inode（）函数来清除这个inode对象。最后，用destroy_inode()将这个内存索引节点释放给inode_cachep slab分配器缓存。
（4）如果该inode的i_nlink不为0，则说明虽然没有人引用这个inode对象，但fs中还有文件链接与这个inode对象相关联，因此这个inode对象有可能再次被引用。对于这种情况，iput()函数首先判断这个inode对象是否正连接在哈希链表中。如果是，则将这个inode对象放到inode_unused链表中（当然，前提是这个inode对象不在s_dirty链表中，如果他在s_dirty链表中，则让他继续待在s_dirty链表中），然后直接返回就可以了。如果不是，则这个inode对象已经不用继续留在icache中了（反正通过哈希链表也找不到他），因此将这个inode对象从其所属的“类型”链表中摘除，然后调用clear_inode()方法清除这个inode对象，最后调用destroy_indoe（）将这个内存索引节点释放回inode_cachep slab分配器缓存。注意！这里不能调用超级块对象的delete_inode（）方法来删除对应的磁盘索引节点，因为fs中还有文件链接与这个inode相关联。

来源：lslnet

northwindrocker · 发表于 2004-9-3 15:53:18

probing出品必属精品！！！

		自动登录	找回密码
密码			注册

Linux inode cache 机制分析

probing出品必属精品！！！

浏览过的版块