amuck-landowner

[C++] extreme fast counting of files in directory

William

pr0
Verified Provider
I used this before for a dir with ~4million files, this takes around 0.5s while ls takes... well... no idea, it never finishes.


compile with "g++ count.cpp -o count && mv count /usr/sbin"

Usage: ./count "<path>"

Code:
#include <stdio.h>
#include <dirent.h>

int main(int argc, char *argv[])
{
    if(argc != 2)
    {
        printf("Usage: ./count \"<path>\"\n");
        return 1;
    }

    struct dirent *de;
    DIR *dir = opendir(argv[1]);
    if(!dir)
    {
        printf("opendir() failed! Does it exist?\n");
        return 1;
    }

    unsigned long count=0;
        while(de = readdir(dir))
     {
          ++count;
     }

    closedir(dir);
    printf("%lu\n", count);

    return 0;
}
 

fixidixi

Active Member
just caught my eyes: count is unsigned long. thats 0 to 4 294 967 295 are there more than that much files?

strace it?
 
Last edited by a moderator:

Francisco

Company Lube
Verified Provider
Any programmer who decides it's OK to store several million files in a single directory shouldn't be allowed to touch a computer again.

.
You know, doing our backups node has been a real test of that very comment.

You'd be simply amazed how many million+ inode qmail queue folders we have on here.

Francisco
 

William

pr0
Verified Provider
Any programmer who decides it's OK to store several million files in a single directory shouldn't be allowed to touch a computer again.

.
Well, sort of, it is zero IO concern for me - I don't count them often and *know* the filename of each file i need to copy/read. (stored in a DB). It also saves me steps of cutting characters and using subdirs.

@William care to tell us if you could solve this? :)
No, i didn't write it - I got it including copyleft from some friend.
 

raindog308

vpsBoard Premium Member
Moderator
#!/usr/bin/perl
opendir (D,$ARGV[0]) || die;
while ($file=readdir(D)) { $count++; }
print $count . "\n";

Code:
$ ./count.pl /tmp
146
There are other ways of doing this...see http://www.perlmonks.org/?node_id=606766

Technically, there are some limitations (which are also present in the C++ code):

  • You'll probably count '.' and '..'
  • You're not descending recursively, nor testing that the directory entry you're counting is a file, directory, link, pipe, etc.

I'd be curious to know what the speed difference is between perl and C.  I don't have a directory with 4 million files laying around though.

BTW, your code is technically C++ because C++ is a superset of C, but it's really just C.
 

raindog308

vpsBoard Premium Member
Moderator
just caught my eyes: count is unsigned long. thats 0 to 4 294 967 295 are there more than that much files?

strace it?
unsigned long is platform dependent.  It could be as small as 16-bit.

long long, on the other hand, is guaranteed to be 64 bits on all platforms.  However, the only guaranteed range is -2^31-1 to 2^31-1.  Stupid.

If you know that you are running on a 64-bit platform, then unsigned long will get you to 18,446,744,073,709,551,615.  Even @Francisco doesn't have 18 quintillion files in one directory.  Though if he does, I bet they're all animated .GIFs and viewing that directory in Windows Explorer would cause a singularity of some sort.

http://en.wikipedia.org/wiki/Long_integer#Common_long_integer_sizes
 

William

pr0
Verified Provider
I tested it quickly:

Your code takes a medium of  0m0.728s (out of 10 tests with variance of up to .020)

The C++ takes a medium of 0m0.254s (out of 10 tests with a variance of up to .003, extremely stable result)

If you know that you are running on a 64-bit platform, then unsigned long will get you to 18,446,744,073,709,551,615.
Good, i only use 64Bit anyway :)
 
Last edited by a moderator:

raindog308

vpsBoard Premium Member
Moderator
I tested it quickly:

Your code takes a medium of  0m0.728s (out of 10 tests with variance of up to .020)

The C++ takes a medium of 0m0.254s (out of 10 tests with a variance of up to .003, extremely stable result)
That is about what I'd expect.  A lot of that .5 is probably the perl interpreter start up.  I suspect if that count was run multiple times in the same script, subsequent calls would be faster.
 

kaniini

Beware the bunny-rabbit!
Verified Provider
To be on the safe side, you might want to 


#include <stdint.h>
 and use 


uint64_t
Actually you should actually use size_t per C99, size_t is meant to be the widest type intended for widths of allocations (in multiples of 1 or more).  So, size_t should be used.
 
Actually you should actually use size_t per C99, size_t is meant to be the widest type intended for widths of allocations (in multiples of 1 or more).  So, size_t should be used.
Good point. For the purpose of this program size_t is better suited, even if one cannot tell the size of the variable without knowing the architecture it's going to run.
 
Top
amuck-landowner