# [C++] extreme fast counting of files in directory



## William (Mar 12, 2014)

I used this before for a dir with ~4million files, this takes around 0.5s while ls takes... well... no idea, it never finishes.


compile with "g++ count.cpp -o count && mv count /usr/sbin"

Usage: ./count "<path>"


```
#include <stdio.h>
#include <dirent.h>

int main(int argc, char *argv[])
{
    if(argc != 2)
    {
        printf("Usage: ./count \"<path>\"\n");
        return 1;
    }

    struct dirent *de;
    DIR *dir = opendir(argv[1]);
    if(!dir)
    {
        printf("opendir() failed! Does it exist?\n");
        return 1;
    }

    unsigned long count=0;
        while(de = readdir(dir))
     {
          ++count;
     }

    closedir(dir);
    printf("%lu\n", count);

    return 0;
}
```


----------



## fixidixi (Mar 12, 2014)

just caught my eyes: count is unsigned long. thats 0 to 4 294 967 295 are there more than that much files?

strace it?


----------



## peterw (Mar 12, 2014)

I can wait 0.5 seconds: ls -a | wc -l


----------



## HostUS-Alexander (Mar 12, 2014)

Nice share.

- Alexander


----------



## William (Mar 12, 2014)

peterw said:


> I can wait 0.5 seconds: ls -a | wc -l


Yea, for your 100 files maybe 

[email protected]:/# time count /home/db/


4183976


real    *0m0.254s*

[email protected]:/# time ls /home/db/ | wc -l


4183974

real    *0m14.002s*


----------



## rds100 (Mar 12, 2014)

Any programmer who decides it's OK to store several million files in a single directory shouldn't be allowed to touch a computer again.

.


----------



## Francisco (Mar 12, 2014)

rds100 said:


> Any programmer who decides it's OK to store several million files in a single directory shouldn't be allowed to touch a computer again.
> 
> .


You know, doing our backups node has been a real test of that very comment.

You'd be simply amazed how many million+ inode qmail queue folders we have on here.

Francisco


----------



## fixidixi (Mar 12, 2014)

@William care to tell us if you could solve this?


----------



## William (Mar 12, 2014)

rds100 said:


> Any programmer who decides it's OK to store several million files in a single directory shouldn't be allowed to touch a computer again.
> 
> .


Well, sort of, it is zero IO concern for me - I don't count them often and *know* the filename of each file i need to copy/read. (stored in a DB). It also saves me steps of cutting characters and using subdirs.



fixidixi said:


> @William care to tell us if you could solve this?


No, i didn't write it - I got it including copyleft from some friend.


----------



## raindog308 (Mar 12, 2014)

#!/usr/bin/perl
opendir (D,$ARGV[0]) || die;
while ($file=readdir(D)) { $count++; }
print $count . "\n";


```
$ ./count.pl /tmp
146
```
There are other ways of doing this...see http://www.perlmonks.org/?node_id=606766

Technically, there are some limitations (which are also present in the C++ code):


You'll probably count '.' and '..'
You're not descending recursively, nor testing that the directory entry you're counting is a file, directory, link, pipe, etc.

I'd be curious to know what the speed difference is between perl and C.  I don't have a directory with 4 million files laying around though.

BTW, your code is technically C++ because C++ is a superset of C, but it's really just C.


----------



## raindog308 (Mar 12, 2014)

fixidixi said:


> just caught my eyes: count is unsigned long. thats 0 to 4 294 967 295 are there more than that much files?
> 
> strace it?


unsigned long is platform dependent.  It could be as small as 16-bit.

long long, on the other hand, is guaranteed to be 64 bits on all platforms.  However, the only guaranteed range is -2^31-1 to 2^31-1.  Stupid.

If you _know_ that you are running on a 64-bit platform, then unsigned long will get you to 18,446,744,073,709,551,615.  Even @Francisco doesn't have 18 quintillion files in one directory.  Though if he does, I bet they're all animated .GIFs and viewing that directory in Windows Explorer would cause a singularity of some sort.

http://en.wikipedia.org/wiki/Long_integer#Common_long_integer_sizes


----------



## William (Mar 12, 2014)

I tested it quickly:

Your code takes a medium of  0m0.728s (out of 10 tests with variance of up to .020)

The C++ takes a medium of 0m0.254s (out of 10 tests with a variance of up to .003, extremely stable result)



> If you _know_ that you are running on a 64-bit platform, then unsigned long will get you to 18,446,744,073,709,551,615.


Good, i only use 64Bit anyway


----------



## qrwteyrutiyoup (Mar 12, 2014)

To be on the safe side, you might want to 


#include <stdint.h>
 and use 


```
uint64_t
```


----------



## Wintereise (Mar 12, 2014)

qrwteyrutiyoup said:


> To be on the safe side, you might want to
> 
> 
> #include <stdint.h>
> ...


+1.


----------



## raindog308 (Mar 12, 2014)

William said:


> I tested it quickly:
> 
> Your code takes a medium of  0m0.728s (out of 10 tests with variance of up to .020)
> 
> The C++ takes a medium of 0m0.254s (out of 10 tests with a variance of up to .003, extremely stable result)


That is about what I'd expect.  A lot of that .5 is probably the perl interpreter start up.  I suspect if that count was run multiple times in the same script, subsequent calls would be faster.


----------



## kaniini (Mar 14, 2014)

qrwteyrutiyoup said:


> To be on the safe side, you might want to
> 
> 
> #include <stdint.h>
> ...


Actually you should actually use size_t per C99, size_t is meant to be the widest type intended for widths of allocations (in multiples of 1 or more).  So, size_t should be used.


----------



## qrwteyrutiyoup (Mar 15, 2014)

kaniini said:


> Actually you should actually use size_t per C99, size_t is meant to be the widest type intended for widths of allocations (in multiples of 1 or more).  So, size_t should be used.


Good point. For the purpose of this program size_t is better suited, even if one cannot tell the size of the variable without knowing the architecture it's going to run.


----------

